为什么lxml.etree.iterparse()占用了我所有的内存? [英] Why is lxml.etree.iterparse() eating up all my memory?
问题描述
这最终会消耗我所有的可用内存,然后该进程被终止.我尝试将标签从schedule
更改为较小"的标签,但这没什么区别.
This eventually consumes all my available memory and then the process is killed. I've tried changing the tag from schedule
to 'smaller' tags but that didn't make a difference.
我在做什么错/如何使用iterparse()
处理这个大文件?
What am I doing wrong / how can I process this large file with iterparse()
?
import lxml.etree
for schedule in lxml.etree.iterparse('really-big-file.xml', tag='schedule'):
print "why does this consume all my memory?"
我可以轻松地将其切碎并以较小的块进行处理,但这比我想要的还要难看.
I can easily cut it up and process it in smaller chunks but that's uglier than I'd like.
推荐答案
随着iterparse
遍历整个文件,将构建一棵树,并且不会释放任何元素.这样做的好处是元素可以记住其父元素是谁,并且您可以形成引用祖先元素的XPath.缺点是它会占用大量内存.
As iterparse
iterates over the entire file a tree is built and no elements are freed. The advantage of doing this is that the elements remember who their parent is, and you can form XPaths that refer to ancestor elements. The disadvantage is that it can consume a lot of memory.
为了在解析时释放一些内存,请使用Liza Daly的fast_iter
:
In order to free some memory as you parse, use Liza Daly's fast_iter
:
def fast_iter(context, func, *args, **kwargs):
"""
http://lxml.de/parsing.html#modifying-the-tree
Based on Liza Daly's fast_iter
http://www.ibm.com/developerworks/xml/library/x-hiperfparse/
See also http://effbot.org/zone/element-iterparse.htm
"""
for event, elem in context:
func(elem, *args, **kwargs)
# It's safe to call clear() here because no descendants will be
# accessed
elem.clear()
# Also eliminate now-empty references from the root node to elem
for ancestor in elem.xpath('ancestor-or-self::*'):
while ancestor.getprevious() is not None:
del ancestor.getparent()[0]
del context
您可以这样使用它:
def process_element(elem):
print "why does this consume all my memory?"
context = lxml.etree.iterparse('really-big-file.xml', tag='schedule', events = ('end', ))
fast_iter(context, process_element)
我强烈推荐该文章,上面的
I highly recommend the article on which the above fast_iter
is based; it should be especially interesting to you if you are dealing with large XML files.
上面显示的fast_iter
是所示版本的略微修改版本
在文章中.这个人对于删除以前的祖先更具攻击性,
从而节省更多的内存. 在这里您将找到一个脚本,该脚本演示了
差异.
The fast_iter
presented above is a slightly modified version of the one shown
in the article. This one is more aggressive about deleting previous ancestors,
thus saves more memory. Here you'll find a script which demonstrates the
difference.
这篇关于为什么lxml.etree.iterparse()占用了我所有的内存?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!