遍历xml元素的有效方法 [英] Efficient way to iterate through xml elements
问题描述
我有一个这样的xml:
I have a xml like this:
<a>
<b>hello</b>
<b>world</b>
</a>
<x>
<y></y>
</x>
<a>
<b>first</b>
<b>second</b>
<b>third</b>
</a>
我需要遍历所有<a>
和<b>
标记,但是我不知道文档中有多少标记.所以我用xpath
来处理这个问题:
I need to iterate through all <a>
and <b>
tags, but I don't know how many of them are in document. So I use xpath
to handle that:
from lxml import etree
doc = etree.fromstring(xml)
atags = doc.xpath('//a')
for a in atags:
btags = a.xpath('b')
for b in btags:
print b
它可以工作,但是我有很大的文件,并且cProfile
告诉我xpath
使用起来非常昂贵.
It works, but I have pretty big files, and cProfile
shows me that xpath
is very expensive to use.
我想知道,也许有更有效的方法来遍历无限多个xml元素?
I wonder, maybe there is there more efficient way to iterate through indefinitely number of xml-elements?
推荐答案
XPath应该很快.您可以将XPath调用数减少为一个:
XPath should be fast. You can reduce the number of XPath calls to one:
doc = etree.fromstring(xml)
btags = doc.xpath('//a/b')
for b in btags:
print b.text
如果那还不够快,您可以尝试莉莎·戴利(Liza Daly)的fast_iter 一个>.这样做的好处是,不需要先使用etree.fromstring
处理整个XML,并且在访问子级之后丢弃父级节点.这两件事都有助于减少内存需求.下面是 fast_iter
的修改版,它在删除不再需要的其他元素方面更具攻击性.
If that is not fast enough, you could try Liza Daly's fast_iter. This has the advantage of not requiring that the entire XML be processed with etree.fromstring
first, and parent nodes are thrown away after the children have been visited. Both of these things help reduce the memory requirements. Below is a modified version of fast_iter
which is more aggressive about removing other elements that are no longer needed.
def fast_iter(context, func, *args, **kwargs):
"""
fast_iter is useful if you need to free memory while iterating through a
very large XML file.
http://lxml.de/parsing.html#modifying-the-tree
Based on Liza Daly's fast_iter
http://www.ibm.com/developerworks/xml/library/x-hiperfparse/
See also http://effbot.org/zone/element-iterparse.htm
"""
for event, elem in context:
func(elem, *args, **kwargs)
# It's safe to call clear() here because no descendants will be
# accessed
elem.clear()
# Also eliminate now-empty references from the root node to elem
for ancestor in elem.xpath('ancestor-or-self::*'):
while ancestor.getprevious() is not None:
del ancestor.getparent()[0]
del context
def process_element(elt):
print(elt.text)
context=etree.iterparse(io.BytesIO(xml), events=('end',), tag='b')
fast_iter(context, process_element)
关于解析大型XML文件的
Liza Daly的文章也读给你听.根据这篇文章,带有fast_iter
的lxml可以比cElementTree
的iterparse
更快. (请参阅表1).
Liza Daly's article on parsing large XML files may prove useful reading to you too. According to the article, lxml with fast_iter
can be faster than cElementTree
's iterparse
. (See Table 1).
这篇关于遍历xml元素的有效方法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!