遍历元素的python xml占用大量内存 [英] python xml iterating over elements takes a lot of memory

查看：45 发布时间：2021/5/10 18:45:08 python xml list xpath generator

本文介绍了遍历元素的python xml占用大量内存的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有一些很大的XML文件(每个文件大约100-150 MB).

I have some very big XML files (around ~100-150 MB each).

我XML中的一个元素是 M (用于成员)，它是 HH (家庭)的子元素-

One element in my XML is M (for member), which is a child of HH (household) -

即-每个家庭包含一个或多个成员.

i.e. - each household contains one or more members.

我需要做的是让所有满足某些条件的成员都接受(条件可能会发生变化，并且可能同时出现在家庭和成员身上-例如-仅来自高收入家庭的成员(对家庭的约束))，其年龄在18-49岁之间(对成员有限制))，并以相当复杂的功能对其进行进一步处理.

What I need to do is to take all the members that satisfies some conditions (the conditions can change, and can be both on the household and on the members - e.g. - just members from households with high income (constraint on the household), who's age is between 18-49 (constraint on the member)) - and to further process them in a rather complicated function.

这就是我正在做的:

import lxml.etree as ET
all_members=[]
tree=ET.parse(whole_path)
root=tree.getroot()
HH_str='//H' #get all the households
HH=tree.xpath(HH_str)
for H in HH:
'''check if the hh satisfies the condition'''
    if(is_valid_hh(H)):
        M_str='.//M'
        M=H.xpath(M_str)
        for m in M:
            if(is_valid_member(m)):
                all_members.append(m)

for member in all_members:
'''do something complicated'''

问题是这占用了我所有的内存(我有32 GB)！如何更有效地遍历xml元素?

the problem with this is that it takes all my memory (and I have 32 GB)! how can I iterate over xml elements more efficiently?

任何帮助将不胜感激...

any help will be appreciated...

推荐答案

etree 将消耗大量内存(是的，即使使用 iterparse())，而 sax 确实很笨重.但是， pulldom 可以营救！

etree is going to consume a lot of memory (yes, even with iterparse()), and sax is really clunky. However, pulldom to the rescue!

from xml.dom import pulldom
doc = pulldom.parse('large.xml')
for event, node in doc:
    if event == pulldom.START_ELEMENT and node.tagName == 'special': 
        # Node is 'empty' here       
        doc.expandNode(node)
        # Now we got it all
        if is_valid_hh(node):
            ...do things...

它是其中的一个图书馆，似乎没有人不使用它.的文件，例如 https://docs.python.org/3.7/library/xml.dom.pulldom.html

It's one of those libraries no one who did not have to use it seems to know about. Docs at e.g. https://docs.python.org/3.7/library/xml.dom.pulldom.html

这篇关于遍历元素的python xml占用大量内存的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

遍历元素的python xml占用大量内存 [英] python xml iterating over elements takes a lot of memory

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

遍历元素的python xml占用大量内存 [英] python xml iterating over elements takes a lot of memory

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭