使用Python ElementTree迭代多个(父,子)节点 [英] Iterating multiple (parent,child) nodes using Python ElementTree

查看:294
本文介绍了使用Python ElementTree迭代多个(父,子)节点的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

ElementTree for Python(2.6)的标准实现未提供从子节点指向父节点的指针。因此,如果需要父母,建议循环父母而不是孩子。



考虑我的xml格式为:

 < Content> 
< Para> first< / Para>
< Table>秒/ Para< / Table>
< Para>第三< / Para>
< / Content>

以下查找所有 Para节点而不考虑父级:

 (1)paras = [p for page.getiterator( Para)中的p]] 

(改编自effbot)可通过循环而不是子节点来存储父节点:

 (2)paras = [page中的p的((c,p).p中的c的page.getiterator())

这很合理,可以有条件地扩展,以达到与(1)相同的结果(据说),但是添加了父信息:

 (3)paras = [page.getiterator()for page in p如果c.tag == Para,则为page.getiterator() b  

ElementTree文档建议使用getiterator()方法进行深度优先搜索。在不寻找父项的情况下运行它(1)会产生以下结果: b

但是,从(3)中的paras中提取文本会得到:

 首先,Content> Para 
第三,Content> Para
第二,Table> Para

这似乎是广度优先。



因此,这引发了两个问题。


  1. 这是正确的预期行为吗?

  2. 当必须要生孩子时,如何提取(父母,孩子)元组某种类型,但父类可以是任何东西,如果必须保持文档顺序。我认为运行两个循环并将(3)生成的(父,子)映射到(1)生成的订单并不理想。


解决方案

考虑以下问题:

 >> xml =< Content> 
...< Para< first // Para>
...< Table< Para>第二< / Para< / Table>
...<第三< / aragt;
...< / Content>
>>将xml.etree.cElementTree导入为
>>>页面= et.fromstring(xml)
>>对于page.getiterator()中的p:
...打印 ppp,p.tag,repr(p.text)
...对于c中的p:
...打印 ccc,c.tag,repr(c.text),p.tag
...
ppp内容'\n'
ccc Para'first'内容
ccc表无内容
ccc参数'第三'内容
ppp参数'第一'
ppp表无
ccc参数'第二'表
ppp参数'第二'
ppp段第三
>>

在旁边:列表理解非常丰富,直到您想确切了解正在迭代的内容:-)



getiterator 正在按广告顺序生成 ppp元素。但是,您会从子 ccc元素中删除感兴趣的元素,这些元素的排列顺序不理想。



一种解决方案是进行自己的迭代:

 >> def process(elem,parent):
...打印elem.tag,repr(elem.text),parent.tag(如果父代不是其他人,则无其他
...对于elem中的子代):
...进程(子元素)
...
>>进程(页面,无)
内容'\n'无
参数'第一'内容
表无内容
参数'第二'表
参数'第三'内容
>>

现在,您可以将 Para元素逐个引用其父元素(如果有),因为它们流过去。



可以很好地将其包装在生成器小工具中:

  >>> def iterate_with_parent(elem):
...堆栈= []
...而1:
...对于反转的孩子(elem):
...堆栈.append((child,elem))
...如果不是堆栈:return
... elem,parent = stack.pop()
... yield elem,parent
...
>>
>> showtag = lambda e:如果e不为e.tag,则为$。
>> showtext = lambda e:repr((e.text或’’.rstrip())
>>>对于iterate_with_parent(页面)中的e,p:
...打印e.tag,showtext(e),showtag(p)
...
第一个内容
表内容
第二段表
第三内容
>


The standard implementation of ElementTree for Python (2.6) does not provide pointers to parents from child nodes. Therefore, if parents are needed, it is suggested to loop over parents rather than children.

Consider my xml is of the form:

<Content>
  <Para>first</Para>
  <Table><Para>second</Para></Table>
  <Para>third</Para>
</Content>

The following finds all "Para" nodes without considering parents:

(1) paras = [p for p in page.getiterator("Para")]

This (adapted from effbot) stores the parent by looping over them instead of the child nodes:

(2) paras = [(c,p) for p in page.getiterator() for c in p]

This makes perfect sense, and can be extended with a conditional to achieve the (supposedly) same result as (1), but with parent info added:

(3) paras = [(c,p) for p in page.getiterator() for c in p if c.tag == "Para"]

The ElementTree documentation suggests that the getiterator() method does a depth-first search. Running it without looking for the parent (1) yields:

first
second
third

However, extracting the text from paras in (3), yields:

first, Content>Para
third, Content>Para
second, Table>Para

This appears to be breadth-first.

This therefore raises two questions.

  1. Is this correct and expected behaviour?
  2. How do you extract (parent, child) tuples when the child must be of a certain type but the parent can be anything, if document order must be maintained. I do not think running two loops and mapping the (parent,child)'s generated by (3) to the orders generated by (1) is ideal.

解决方案

Consider this:

>>> xml = """<Content>
...   <Para>first</Para>
...   <Table><Para>second</Para></Table>
...   <Para>third</Para>
... </Content>"""
>>> import xml.etree.cElementTree as et
>>> page = et.fromstring(xml)
>>> for p in page.getiterator():
...     print "ppp", p.tag, repr(p.text)
...     for c in p:
...         print "ccc", c.tag, repr(c.text), p.tag
...
ppp Content '\n  '
ccc Para 'first' Content
ccc Table None Content
ccc Para 'third' Content
ppp Para 'first'
ppp Table None
ccc Para 'second' Table
ppp Para 'second'
ppp Para 'third'
>>> 

Aside: list comprehensions are magnificent until you want to see exactly what is being iterated over :-)

getiterator is producing the "ppp" elements in the advertised order. However you are plucking your elements of interest out of the subsidiary "ccc" elements, which are not in your desired order.

One solution is to do your own iteration:

>>> def process(elem, parent):
...    print elem.tag, repr(elem.text), parent.tag if parent is not None else None
...    for child in elem:
...       process(child, elem)
...
>>> process(page, None)
Content '\n  ' None
Para 'first' Content
Table None Content
Para 'second' Table
Para 'third' Content
>>>

Now you can snarf "Para" elements each with a reference to its parent (if any) as they stream past.

This can be wrapped up nicely in a generator gadget:

>>> def iterate_with_parent(elem):
...     stack = []
...     while 1:
...         for child in reversed(elem):
...             stack.append((child, elem))
...         if not stack: return
...         elem, parent = stack.pop()
...         yield elem, parent
...
>>>
>>> showtag = lambda e: e.tag if e is not None else None
>>> showtext = lambda e: repr((e.text or '').rstrip())
>>> for e, p in iterate_with_parent(page):
...     print e.tag, showtext(e), showtag(p)
...
Para 'first' Content
Table '' Content
Para 'second' Table
Para 'third' Content
>>>

这篇关于使用Python ElementTree迭代多个(父,子)节点的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆