遍历lxml etree中的文本和元素 [英] Iterate over both text and elements in lxml etree

查看:324
本文介绍了遍历lxml etree中的文本和元素的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

假设我有以下XML文档:

Suppose I have the following XML document:

<species>
    Mammals: <dog/> <cat/>
    Reptiles: <snake/> <turtle/>
    Birds: <seagull/> <owl/>
</species>

然后我得到这样的species元素:

Then I get the species element like this:

import lxml.etree
doc = lxml.etree.fromstring(xml)
species = doc.xpath('/species')[0]

现在,我想打印按物种分组的动物清单.如何使用ElementTree API做到这一点?

Now I would like to print a list of animals grouped by species. How could I do it using ElementTree API?

推荐答案

如果枚举所有节点,则会看到一个带有类的文本节点,然后是带有种类的元素节点:

If you enumerate all of the nodes, you'll see a text node with the class followed by element nodes with the species:

>>> for node in species.xpath("child::node()"):
...     print type(node), node
... 
<class 'lxml.etree._ElementStringResult'> 
    Mammals: 
<type 'lxml.etree._Element'> <Element dog at 0xe0b3c0>
<class 'lxml.etree._ElementStringResult'>  
<type 'lxml.etree._Element'> <Element cat at 0xe0b410>
<class 'lxml.etree._ElementStringResult'> 
    Reptiles: 
<type 'lxml.etree._Element'> <Element snake at 0xe0b460>
<class 'lxml.etree._ElementStringResult'>  
<type 'lxml.etree._Element'> <Element turtle at 0xe0b4b0>
<class 'lxml.etree._ElementStringResult'> 
    Birds: 
<type 'lxml.etree._Element'> <Element seagull at 0xe0b500>
<class 'lxml.etree._ElementStringResult'>  
<type 'lxml.etree._Element'> <Element owl at 0xe0b550>
<class 'lxml.etree._ElementStringResult'> 

因此您可以从那里构建它:

So you can build it from there:

my_species = {}
current_class = None
for node in species.xpath("child::node()"):
    if isinstance(node, lxml.etree._ElementStringResult):
        text = node.strip(' \n\t:')
        if text:
            current_class = my_species.setdefault(text, [])
    elif isinstance(node, lxml.etree._Element):
        if current_class is not None:
            current_class.append(node.tag)
print my_species

结果

{'Mammals': ['dog', 'cat'], 'Reptiles': ['snake', 'turtle'], 'Birds': ['seagull', 'owl']}

这都是脆弱的...对文本节点的排列方式进行小的更改可能会使分析混乱.

This is all fragile... small changes in how the text nodes are arranged can mess up the parsing.

这篇关于遍历lxml etree中的文本和元素的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆