在python 2.7 lxml中使用命名空间访问xml文件中的值 [英] Accesing values in xml file with namespaces in python 2.7 lxml
问题描述
我正在跟踪此链接,以尝试获取多个标签的值:
I'm following this link to try to get values of several tags:
通过"ElementTree"在XML中使用命名空间解析XML
在此链接中,可以像这样访问根标记没有问题:
In this link there is no problem to access to the root tag like this:
import sys
from lxml import etree as ET
doc = ET.parse('file.xml')
namespaces_rdf = {'rdf': 'http://www.w3.org/1999/02/22-rdf-syntax-ns#'} # add more as needed
namespaces_dcat = {'dcat': 'http://www.w3.org/ns/dcat#'} # add more as needed
namespaces_dct = {'dct': 'http://purl.org/dc/terms/'}
print doc.findall('rdf:RDF', namespaces_rdf)
print doc.findall('dcat:Dataset', namespaces_dcat)
print doc.findall('dct:identifier', namespaces_dct)
输出:
[]
[<Element {http://www.w3.org/ns/dcat#}Dataset at 0x2269b98>]
[]
我只能访问dcat:Dataset,但看不到如何访问rdf:about
I get only access to dcat:Dataset, and I can't see how to access the value of rdf:about
并随后访问dct:identifier
And later access to dct:identifier
当然,一旦我访问了此信息,就需要访问dcat:发行信息
Of course, once I have accessed to this info, I need to acces to dcat:distribution info
这是我的示例文件,它是使用ckanext-dcat生成的:
This is my example file, generated with ckanext-dcat:
<?xml version="1.0" encoding="utf-8"?>
<rdf:RDF
xmlns:dct="http://purl.org/dc/terms/"
xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
xmlns:dcat="http://www.w3.org/ns/dcat#"
>
<dcat:Dataset rdf:about="http://www.myweb.com/dataset/ec631628-2f46-4f17-a685-d62a37466c01">
<dct:identifier>ec631628-2f46-4f17-a685-d62a37466c01</dct:identifier>
<dct:description>FOO-Description</dct:description>
<dct:title>FOO-title</dct:title>
<dcat:keyword>keyword1</dcat:keyword>
<dcat:keyword>keyword2</dcat:keyword>
<dct:issued rdf:datatype="http://www.w3.org/2001/XMLSchema#dateTime">2014-10-08T08:55:04.566618</dct:issued>
<dct:modified rdf:datatype="http://www.w3.org/2001/XMLSchema#dateTime">2015-06-25T11:04:10.328902</dct:modified>
<dcat:distribution>
<dcat:Distribution rdf:about="http://www.myweb.com/dataset/ec631628-2f46-4f17-a685-d62a37466c01/resource/f5707551-6bf3-468f-9a96-b4184cc51d1f">
<dct:title>FOO-title-1</dct:title>
<dct:description>FOO-Description-1</dct:description>
<dcat:accessURL>http://www.myweb.com/dataset/ec631628-2f46-4f17-a685-d62a37466c01/resource/f5707551-6bf3-468f-9a96-b4184cc51d1f/download/myxls.xls</dcat:accessURL>
<dct:format>XLS</dct:format>
</dcat:Distribution>
</dcat:distribution>
<dcat:distribution>
<dcat:Distribution rdf:about="http://www.myweb.com/dataset/ec631628-2f46-4f17-a685-d62a37466c01/resource/74c1acc8-b2b5-441b-afb2-d072d0d00a7f">
<dct:format>XLS</dct:format>
<dct:title>FOO-title-2</dct:title>
<dct:description>FOO-Description-2</dct:description>
<dcat:accessURL>http://www.myweb.com/dataset/ec631628-2f46-4f17-a685-d62a37466c01/resource/74c1acc8-b2b5-441b-afb2-d072d0d00a7f/download/myxls.xls</dcat:accessURL>
</dcat:Distribution>
</dcat:distribution>
</dcat:Dataset>
</rdf:RDF>
关于如何访问此信息的任何想法?? 谢谢
Any idea on how to access this info?? Thanks
更新: 好吧,我需要在以下位置访问 rdf:about :
UPDATE: Well, I need to access rdf:about in:
<dcat:Dataset rdf:about="http://www.myweb.com/dataset/ec631628-2f46-4f17-a685-d62a37466c01">
因此,此代码取自:
for node in doc.xpath('//dcat:Dataset', namespaces=namespaces):
# Iterate over attributes
for attrib in node.attrib:
print '@' + attrib + '=' + node.attrib[attrib]
我得到以下输出:
[<Element {http://www.w3.org/ns/dcat#}Dataset at 0x23d8ee0>]
@{http://www.w3.org/1999/02/22-rdf-syntax-ns#}about=http://www.myweb.com/dataset/ec631628-2f46-4f17-a685-d62a37466c01
所以,问题是:
我如何询问该属性是否为约以采用此值,因为在其他文件中,我有多个标签.
How can I ask if the attribute is about to take this value, because in other files I have several tags.
更新2:修正了我如何获得价值(克拉克符号)
UPDATE 2: Fixed how I get about value (clark notations)
for node in doc.xpath('//dcat:Dataset', namespaces=namespaces):
# Iterate over attributes
for attrib in node.attrib:
if attrib.endswith('about'):
#do my jobs
好了,快要完成了,但是我还有最后一个问题:我需要知道何时访问我的
<dct:title>
我属于:
<dcat:Dataset rdf:about="http://www.myweb.com/dataset/ec631628-2f46-4f17-a685-d62a37466c01">
<dct:title>FOO-title</dct:title>
<dcat:Distribution rdf:about="http://www.myweb.com/dataset/ec631628-2f46-4f17-a685-d62a37466c01/resource/f5707551-6bf3-468f-9a96-b4184cc51d1f">
<dct:title>FOO-title-1</dct:title>
<dcat:Distribution rdf:about="http://www.myweb.com/dataset/ec631628-2f46-4f17-a685-d62a37466c01/resource/74c1acc8-b2b5-441b-afb2-d072d0d00a7f">
<dct:title>FOO-title-2</dct:title>
如果我做这样的事情,我会得到:
If I do something like this I get:
for node in doc.xpath('//dct:title', namespaces=namespaces):
print node.tag, node.text
{http://purl.org/dc/terms/}title FOO-title
{http://purl.org/dc/terms/}title FOO-title-1
{http://purl.org/dc/terms/}title FOO-title-2
谢谢
推荐答案
使用xpath()
方法和namespaces
命名参数:
namespaces = {
'rdf': 'http://www.w3.org/1999/02/22-rdf-syntax-ns#',
'dcat': 'http://www.w3.org/ns/dcat#',
'dct': 'http://purl.org/dc/terms/'
}
print(doc.xpath('//rdf:RDF', namespaces=namespaces))
print(doc.xpath('//dcat:Dataset', namespaces=namespaces))
print(doc.xpath('//dct:identifier', namespaces=namespaces))
这篇关于在python 2.7 lxml中使用命名空间访问xml文件中的值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!