lxml和fast_iter占用了所有内存 [英] lxml and fast_iter eating all the memory

查看:80
本文介绍了lxml和fast_iter占用了所有内存的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想在OS X(10.8.2)上使用lxml(3.2.0)使用Python(2.7.2)解析1.6 GB的XML文件.因为我已经阅读了有关内存消耗的潜在问题,所以我已经使用 fast_iter ,但是在主循环之后,它吞噬了大约8 GB的RAM,即使它没有保留实际XML文件中的任何数据.

I want to parse a 1.6 GB XML file with Python (2.7.2) using lxml (3.2.0) on OS X (10.8.2). Because I had already read about potential issues with memory consumption, I already use fast_iter in it, but after the main loop, it eats up about 8 GB RAM, even it doesn't keep any data from the actual XML file.

from lxml import etree

def fast_iter(context, func, *args, **kwargs):
    # http://www.ibm.com/developerworks/xml/library/x-hiperfparse/
    # Author: Liza Daly
    for event, elem in context:
        func(elem, *args, **kwargs)
        elem.clear()
        while elem.getprevious() is not None:
            del elem.getparent()[0]
    del context

def process_element(elem):
    pass

context = etree.iterparse("sachsen-latest.osm", tag="node", events=("end", ))
fast_iter(context, process_element)

我不明白为什么会发生如此大的泄漏,因为元素和整个上下文已在fast_iter()中删除,而现在我什至没有处理XML数据.

I don't get, why there is such a massive leakage, because the element and the whole context is being deleted in fast_iter() and at the moment I don't even process the XML data.

有什么想法吗?

推荐答案

问题出在etree.iterparse()的行为上.您可能会认为它仅对每个node元素使用内存,但事实证明,它仍将所有其他元素保留在内存中.由于您没有清除它们,因此内存最终会消耗blowing尽,特别是在解析.osm(OpenStreetMaps)文件并查找节点时,更是如此.

The problem is with the behavior of etree.iterparse(). You would think it only uses memory for each node element, but it turns out it still keeps all the other elements in memory. Since you don't clear them, memory ends up blowing up later on, specially when parsing .osm (OpenStreetMaps) files and looking for nodes, but more on that later.

我发现的解决方案不是捕获node标签,而是捕获所有标签:

The solution I found was not to catch node tags but catch all tags:

context = etree.iterparse(open(filename,'r'),events=('end',))

然后清除所有标签,但仅解析您感兴趣的标签:

And then clear all the tags, but only parse the ones you are interested in:

for (event,elem) in progress.bar(context):
    if elem.tag == 'node':
        # do things here

    elem.clear()
    while elem.getprevious() is not None:
        del elem.getparent()[0]
del context

请记住,它可能会删除您感兴趣的其他元素,因此请确保在需要的地方添加更多的ifs.例如(这是特定于.osm的)tagsnodes

Do keep in mind that it may delete other elements that you are interested in, so make sure to add more ifs where needed. For example (And this is .osm specific) tags nested from nodes

if elem.tag == 'tag':
    continue
if elem.tag == 'node':
    for tag in elem.iterchildren():
        # do stuff

稍后内存耗尽的原因非常有趣,.osm文件是组织 nodes首先出现的方式,然后是ways然后是relations.因此,您的代码在开始时就可以很好地处理节点,然后在etree遍历其余元素时填充内存.

The reason why memory was blowing up later is pretty interesting, .osm files are organized in a way that nodes come first, then ways then relations. So your code does fine with nodes at the beginning, then memory gets filled as etree goes through the rest of the elements.

这篇关于lxml和fast_iter占用了所有内存的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆