为什么 lxml.etree.iterparse() 占用了我所有的内存? [英] Why is lxml.etree.iterparse() eating up all my memory?

查看:17
本文介绍了为什么 lxml.etree.iterparse() 占用了我所有的内存?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

这最终消耗了我所有的可用内存,然后进程被终止.我尝试将标签从 schedule 更改为较小"标签,但这并没有什么区别.

This eventually consumes all my available memory and then the process is killed. I've tried changing the tag from schedule to 'smaller' tags but that didn't make a difference.

我做错了什么/如何使用 iterparse() 处理这个大文件?

What am I doing wrong / how can I process this large file with iterparse()?

import lxml.etree

for schedule in lxml.etree.iterparse('really-big-file.xml', tag='schedule'):
    print "why does this consume all my memory?"

我可以轻松地将其切碎并处理成更小的块,但这比我想要的要丑陋.

I can easily cut it up and process it in smaller chunks but that's uglier than I'd like.

推荐答案

iterparse 遍历整个文件时,会构建一棵树,但不会释放任何元素.这样做的好处是元素记住它们的父元素是谁,并且您可以形成引用祖先元素的 XPath.缺点是会消耗大量内存.

As iterparse iterates over the entire file a tree is built and no elements are freed. The advantage of doing this is that the elements remember who their parent is, and you can form XPaths that refer to ancestor elements. The disadvantage is that it can consume a lot of memory.

为了在解析时释放一些内存,请使用 Liza Daly 的 fast_iter:

In order to free some memory as you parse, use Liza Daly's fast_iter:

def fast_iter(context, func, *args, **kwargs):
    """
    http://lxml.de/parsing.html#modifying-the-tree
    Based on Liza Daly's fast_iter
    http://www.ibm.com/developerworks/xml/library/x-hiperfparse/
    See also http://effbot.org/zone/element-iterparse.htm
    """
    for event, elem in context:
        func(elem, *args, **kwargs)
        # It's safe to call clear() here because no descendants will be
        # accessed
        elem.clear()
        # Also eliminate now-empty references from the root node to elem
        for ancestor in elem.xpath('ancestor-or-self::*'):
            while ancestor.getprevious() is not None:
                del ancestor.getparent()[0]
    del context

然后你可以这样使用:

def process_element(elem):
    print "why does this consume all my memory?"
context = lxml.etree.iterparse('really-big-file.xml', tag='schedule', events = ('end', ))
fast_iter(context, process_element)

我强烈推荐这篇文章,上面的fast_iter 基于;如果您正在处理大型 XML 文件,这对您来说应该特别有趣.

I highly recommend the article on which the above fast_iter is based; it should be especially interesting to you if you are dealing with large XML files.

上面显示的 fast_iter 是所示版本的略微修改版本在文章中.这个对删除以前的祖先比较激进,从而节省更多内存.在这里你会找到一个脚本,它演示了区别.

The fast_iter presented above is a slightly modified version of the one shown in the article. This one is more aggressive about deleting previous ancestors, thus saves more memory. Here you'll find a script which demonstrates the difference.

这篇关于为什么 lxml.etree.iterparse() 占用了我所有的内存?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆