遍历元素的python xml占用大量内存 [英] python xml iterating over elements takes a lot of memory

查看:45
本文介绍了遍历元素的python xml占用大量内存的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一些很大的XML文件(每个文件大约100-150 MB).

I have some very big XML files (around ~100-150 MB each).

我XML中的一个元素是 M (用于成员),它是 HH (家庭)的子元素-

One element in my XML is M (for member), which is a child of HH (household) -

即-每个家庭包含一个或多个成员.

i.e. - each household contains one or more members.

我需要做的是让所有满足某些条件的成员都接受(条件可能会发生变化,并且可能同时出现在家庭和成员身上-例如-仅来自高收入家庭的成员(对家庭的约束)),其年龄在18-49岁之间(对成员有限制)),并以相当复杂的功能对其进行进一步处理.

What I need to do is to take all the members that satisfies some conditions (the conditions can change, and can be both on the household and on the members - e.g. - just members from households with high income (constraint on the household), who's age is between 18-49 (constraint on the member)) - and to further process them in a rather complicated function.

这就是我正在做的:

import lxml.etree as ET
all_members=[]
tree=ET.parse(whole_path)
root=tree.getroot()
HH_str='//H' #get all the households
HH=tree.xpath(HH_str)
for H in HH:
'''check if the hh satisfies the condition'''
    if(is_valid_hh(H)):
        M_str='.//M'
        M=H.xpath(M_str)
        for m in M:
            if(is_valid_member(m)):
                all_members.append(m)

for member in all_members:
'''do something complicated'''

问题是这占用了我所有的内存(我有32 GB)!如何更有效地遍历xml元素?

the problem with this is that it takes all my memory (and I have 32 GB)! how can I iterate over xml elements more efficiently?

任何帮助将不胜感激...

any help will be appreciated...

推荐答案

etree 将消耗大量内存(是的,即使使用 iterparse()),而 sax 确实很笨重.但是, pulldom 可以营救!

etree is going to consume a lot of memory (yes, even with iterparse()), and sax is really clunky. However, pulldom to the rescue!

from xml.dom import pulldom
doc = pulldom.parse('large.xml')
for event, node in doc:
    if event == pulldom.START_ELEMENT and node.tagName == 'special': 
        # Node is 'empty' here       
        doc.expandNode(node)
        # Now we got it all
        if is_valid_hh(node):
            ...do things...

它是其中的一个图书馆,似乎没有人不使用它.的文件,例如 https://docs.python.org/3.7/library/xml.dom.pulldom.html

It's one of those libraries no one who did not have to use it seems to know about. Docs at e.g. https://docs.python.org/3.7/library/xml.dom.pulldom.html

这篇关于遍历元素的python xml占用大量内存的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆