如何在Python中使用LXML捕获XML文件的所有元素名称? [英] How I do capture all of the element names of an XML file using LXML in Python?

查看:119
本文介绍了如何在Python中使用LXML捕获XML文件的所有元素名称?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

尽管要经历令人困惑的示例和教程很费劲,但我仍然可以使用lxml完成​​大部分我想做的事情.简而言之,我能够读取一个外部xml文件,并通过lxml将其导入正确的树状格式.

I am able to use lxml to accomplish most of what I would like to do, although it was a struggle to go through the obfuscating examples and tutorials. In short, I am able to read an external xml file and import it via lxml into the proper tree-like format.

为了演示这一点,如果我要输入:

To demonstrate this, if I were to type:

print(etree.tostring(myXmlTree, pretty_print= True, method= "xml") )

我得到以下输出:

<net xmlns="http://www.arin.net/whoisrws/core/v1" xmlns:ns2="http://www.arin.net/whoisrws/rdns/v1" xmlns:ns3="http://www.arin.net/whoisrws/netref/v2" termsOfUse="https://www.arin.net/whois_tou.html">
 <registrationDate>2006-08-29T00:00:00-04:00</registrationDate>
 <ref>http://whois.arin.net/rest/net/NET-79-0-0-0-1</ref>
 <endAddress>79.255.255.255</endAddress>
 <handle>NET-79-0-0-0-1</handle>
 <name>79-RIPE</name>
 <netBlocks>
  <netBlock>
   <cidrLength>8</cidrLength>
   <endAddress>79.255.255.255</endAddress>
   <description>Allocated to RIPE NCC</description>
   <type>RN</type>
   <startAddress>79.0.0.0</startAddress>
  </netBlock>
 </netBlocks>
 <orgRef name="RIPE Network Coordination Centre" handle="RIPE">http://whois.arin.net/rest/org/RIPE</orgRef>
 <comment>
  <line number="0">These addresses have been further assigned to users in</line>
  <line number="1">the RIPE NCC region. Contact information can be found in</line>
  <line number="2">the RIPE database at http://www.ripe.net/whois</line>
 </comment>
 <startAddress>79.0.0.0</startAddress>
 <updateDate>2009-05-18T07:34:02-04:00</updateDate>
 <version>4</version>
</net>

好的,这很适合人类消费,但对机器却没有用.如果我想要特定的元素,例如说xml中的起始IP地址和结束IP地址,则可以输入:

OK, that's great for human consumption, but not useful for machines. If I'd wanted particular elements, like say the start and end IP addresses in the xml, I could type:

ns = myXmlTree.nsmap.values()[0]
myXmlTree.findall("{" + ns + "}startAddress")[0].text
myXmlTree.findall("{" + ns + "}endAddress")[0].text

我会收到:

'79.0.0.0'
'79.255.255.255'

但是我仍然需要以人的身份查看xml文件,才能知道其中有哪些元素.相反,我希望能够检索特定级别的所有元素的名称,然后自动遍历该级别.因此,例如,我想做类似的事情:

But I still need to LOOK at the xml file as a human to know what elements are there. Instead, I would like to be able to retrieve the names of ALL of the elements at a particular level and then automatically traverse that level. So, for instance, I'd like to do something like:

myElements = myXmlTree.findallelements("{" + ns + "}")

它将给我一个类似于以下的返回值:

and it would give me a return value something like:

['registrationDate', 'ref', 'endAddress', 'handle', 'name', 'netBlocks', 'orgRef', 'comment', 'startAddress', 'updateDate', 'version']

特别棒的是,它可以告诉我元素的整个结构,包括嵌套的元素.

Especially awesome would be if it could tell me the entire structure of elements, including the nested ones.

我确定有办法,否则就没有意义了.

I'm SURE there's a way, as it wouldn't make sense otherwise.

提前谢谢!

P.S.,我知道我可以迭代并遍历所有迭代的列表.我希望lxml中已经有一个包含这些数据的方法.如果迭代是唯一的方法,我想那没问题……对我而言,这似乎很笨拙.

P.S., I know that I can iterate and go through the list of all iterations. I was hoping there was already a method within lxml that had these data. If iteration is the only way, I guess that's OK... it just seems clunky to me.

推荐答案

我相信您正在寻找

I believe you are looking for element.xpath().

XPath 不是lxml引入的概念,而是用于选择节点的通用查询语言从XML文档中得到许多处理XML的支持.可以将其视为类似于CSS选择器的东西,但是功能更强大(也稍微复杂一些).请参阅 XPath语法 .

XPath is not a concept introduced by lxml but a general query language for selecting nodes from an XML document supported by many things that deal with XML. Think of it as something similar to CSS selectors, but more powerful (also a bit more complicated). See XPath Syntax.

您的文档使用名称空间-我将暂时忽略该名称空间,并在文章结尾处说明如何处理它们,因为它使示例更加可读. (但它们不能按原样工作用于您的文档.)

Your document uses namespaces - I'll ignore that for now and explain at the end of the post how to deal with them, because it keeps the examples more readable that way. (But they won't work as-is for your document).

例如,

tree.xpath('/net/endAddress')

将直接在<net />节点下选择<endAddress>79.255.255.255</endAddress>元素.但<netBlock>内的<endAddress />不在.

would select the <endAddress>79.255.255.255</endAddress> element direcly below the <net /> node. But not the <endAddress /> inside the <netBlock>.

XPath表达式

tree.xpath('//endAddress')

但是会选择文档中任何位置的所有<endAddress />节点.

however would select all <endAddress /> nodes anywhere in the document.

您当然可以进一步查询使用XPath epxressions返回的节点:

You can of course further query the nodes you get back with XPath epxressions:

netblocks = tree.xpath('/net/netBlocks/netBlock')
for netblock in netblocks:
    start = netblock.xpath('./startAddress/text()')[0]
    end = netblock.xpath('./endAddress/text()')[0]
    print "%s - %s" % (start, end)

会给你

79.0.0.0 - 79.255.255.255

请注意,.xpath()始终会返回所选节点的列表-因此,如果您只想要一个,请为其说明.

Notice that .xpath() always returns a list of selected nodes - so if you want just one, account for that.

您还可以通过元素的属性来选择元素:

You can also select elements by their attributes:

comment = tree.xpath('/net/comment')[0]
line_2 = comment.xpath("./line[@number='2']")[0]

这将从第一个注释中选择带有number="2"<line />元素.

This would select the <line /> element with number="2" from the first comment.

您还可以自己选择属性:

You can also select attributes themselves:

numbers = tree.xpath('//line/attribute::number')

['0', '1', '2']

要获取您最后询问的元素名称列表,可以执行以下操作:

To get the list of element names you asked about last, you could do something likes this:

names = [node.tag for node in tree.xpath('/net/*')]

['registrationDate', 'ref', 'endAddress', 'handle', 'name', 'netBlocks', 'orgRef', 'comment', 'startAddress', 'updateDate', 'version']

但是,鉴于XPath的强大功能,最好查询文档以了解您想从文档中了解什么,具体或随意查看.

But given the power of XPath, it's probably better to just query the document for what you want to know from it, as specific or loose as you see fit.

现在,名称空间.如您所注意到的,如果您的文档使用XML名称空间,则需要在许多地方考虑到这一点,XPath也不例外.查询命名空间文档时,您可以通过xpath()方法传递命名空间映射,如下所示:

Now, namespaces. As you noticed, if your document uses XML namespaces, you need to take that into consideration in many places, and XPath is no different. When querying a namespaced document, you pass the xpath() method the namespace map like this:

NSMAP = {'ns':  'http://www.arin.net/whoisrws/core/v1',
         'ns2': 'http://www.arin.net/whoisrws/rdns/v1',
         'ns3': 'http://www.arin.net/whoisrws/netref/v2'}

names = [node.tag for node in tree.xpath('/ns:net/*', namespaces=NSMAP)]

lxml的许多其他地方,您可以通过使用None作为名称空间映射中的字典键来具体化默认名称空间.不幸的是,xpath()不能,这会引发异常

In many other places in lxml you can speficy the default namespace by using None as the dictionary key in the namespace map. Not with xpath() unfortunately, that will raise an exception

TypeError: empty namespace prefix is not supported in XPath

因此,不幸的是,您必须在XPath表达式中的每个节点名称前加上ns:(或选择将名称空间映射到的任何名称).

So you unfortunately have to prefix every node name in your XPath expression with ns: (or whatever you choose to map that namespace to).

有关XPath语法的更多信息,请参见例如> XPath语法 > 页中的 W3Schools Xpath教程.

For more information on the XPath syntax, see for example the XPath Syntax page in the W3Schools Xpath Tutorial.

要开始使用XPath,在许多 XPath测试人员 .另外,用于Firefox的Firebug插件或Google Chrome检查器允许您显示所选元素的XPath(或其中的许多XPath).

To get going with XPath it can also be very helpful to fiddle around with your document in one of the many XPath testers. Also, the Firebug plugin for Firefox, or Google Chrome inspector allow you to show the (or rather, one of many) XPath for the selected element.

这篇关于如何在Python中使用LXML捕获XML文件的所有元素名称?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆