为什么lxml.etree.iterparse()占用了我所有的内存? [英] Why is lxml.etree.iterparse() eating up all my memory?

查看:319
本文介绍了为什么lxml.etree.iterparse()占用了我所有的内存?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

这最终会消耗我所有的可用内存,然后该进程被终止.我尝试将标签从schedule更改为较小"的标签,但这没什么区别.

This eventually consumes all my available memory and then the process is killed. I've tried changing the tag from schedule to 'smaller' tags but that didn't make a difference.

我在做什么错/如何使用iterparse()处理这个大文件?

What am I doing wrong / how can I process this large file with iterparse()?

import lxml.etree

for schedule in lxml.etree.iterparse('really-big-file.xml', tag='schedule'):
    print "why does this consume all my memory?"

我可以轻松地将其切碎并以较小的块进行处理,但这比我想要的还要难看.

I can easily cut it up and process it in smaller chunks but that's uglier than I'd like.

推荐答案

随着iterparse遍历整个文件,将构建一棵树,并且不会释放任何元素.这样做的好处是元素可以记住其父元素是谁,并且您可以形成引用祖先元素的XPath.缺点是它会占用大量内存.

As iterparse iterates over the entire file a tree is built and no elements are freed. The advantage of doing this is that the elements remember who their parent is, and you can form XPaths that refer to ancestor elements. The disadvantage is that it can consume a lot of memory.

为了在解析时释放一些内存,请使用Liza Daly的fast_iter:

In order to free some memory as you parse, use Liza Daly's fast_iter:

def fast_iter(context, func, *args, **kwargs):
    """
    http://lxml.de/parsing.html#modifying-the-tree
    Based on Liza Daly's fast_iter
    http://www.ibm.com/developerworks/xml/library/x-hiperfparse/
    See also http://effbot.org/zone/element-iterparse.htm
    """
    for event, elem in context:
        func(elem, *args, **kwargs)
        # It's safe to call clear() here because no descendants will be
        # accessed
        elem.clear()
        # Also eliminate now-empty references from the root node to elem
        for ancestor in elem.xpath('ancestor-or-self::*'):
            while ancestor.getprevious() is not None:
                del ancestor.getparent()[0]
    del context

您可以这样使用它:

def process_element(elem):
    print "why does this consume all my memory?"
context = lxml.etree.iterparse('really-big-file.xml', tag='schedule', events = ('end', ))
fast_iter(context, process_element)

我强烈推荐该文章,上面的基于;如果您要处理大型XML文件,这对您来说应该特别有趣.

I highly recommend the article on which the above fast_iter is based; it should be especially interesting to you if you are dealing with large XML files.

上面显示的fast_iter是所示版本的略微修改版本 在文章中.这个人对于删除以前的祖先更具攻击性, 从而节省更多的内存. 在这里您将找到一个脚本,该脚本演示了 差异.

The fast_iter presented above is a slightly modified version of the one shown in the article. This one is more aggressive about deleting previous ancestors, thus saves more memory. Here you'll find a script which demonstrates the difference.

这篇关于为什么lxml.etree.iterparse()占用了我所有的内存?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆