适用于80 + GB XML的Python sax到lxml [英] Python sax to lxml for 80+GB XML

查看:78
本文介绍了适用于80 + GB XML的Python sax到lxml的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

您将如何使用sax读取XML文件并将其转换为lxml etree.iterparse元素?

How would you read an XML file using sax and convert it to a lxml etree.iterparse element?

为概述该问题,我使用lxml构建了XML提取工具,用于XML提要,其大小范围为25-500MB,需要每两天进行一次提取,但需要执行一次时间摄取的文件大小为60-100GB.

To provide an overview of the problem, I have built an XML ingestion tool using lxml for an XML feed that will range in the size of 25 - 500MB that needs ingestion on a bi-daily basis, but needs to perform a one time ingestion of a file that is 60 - 100GB's.

我选择使用lxml的依据是,详细说明了一个节点的大小不得超过4 -8 GB,我认为这将允许该节点被读入内存并在完成后清除.

I had chosen to use lxml based on the specifications that detailed a node would not exceed 4 -8 GB's in size which I thought would allow the node to be read into memory and cleared when finished.

代码下方的概述

elements = etree.iterparse(
    self._source, events = ('end',)
)
for event, element in elements:
    finished = True
    if element.tag == 'Artist-Types':
        self.artist_types(element)

def artist_types(self, element):
    """
    Imports artist types

    :param list element: etree.Element
    :returns boolean:
    """
    self._log.info("Importing Artist types")
    count = 0
    for child in element:
        failed = False
        fields = self._getElementFields(child, (
            ('id', 'Id'),
            ('type_code', 'Type-Code'),
            ('created_date', 'Created-Date')
        ))
        if self._type is IMPORT_INC and has_artist_type(fields['id']):
            if update_artist_type(fields['id'], fields['type_code']):
                count = count + 1
            else:
                failed = True
        else:
            if create_artist_type(fields['type_code'],
                fields['created_date'], fields['id']):
                count = count + 1
            else:
                failed = True
        if failed:
            self._log.error("Failed to import artist type %s %s" %
                (fields['id'], fields['type_code'])
            )
    self._log.info("Imported %d Artist Types Records" % count)
    self._artist_type_count = count
    self._cleanup(element)
    del element

让我知道是否可以添加任何类型的说明.

Let me know if I can add any type of clarification.

推荐答案

iterparse是迭代解析器.它将发出Element对象和事件,并在解析时以增量方式构建整个Element树,因此最终它将整个树存储在内存中.

iterparse is an iterative parser. It will emit Element objects and events and incrementally build the entire Element tree as it parses, so eventually it will have the whole tree in memory.

但是,很容易出现有限的内存行为:在解析元素时删除不再需要的元素.

However, it is easy to have a bounded memory behavior: delete elements you don't need anymore as you parse them.

典型的巨型xml"工作负载是单个根元素,带有大量代表记录的子元素.我认为这是您正在使用的XML结构吗?

The typical "giant xml" workload is a single root element with a large number of child elements which represent records. I assume this is the kind of XML structure you are working with?

通常,使用clear()清空您正在处理的元素就足够了.您的内存使用量会增加一点,但不是很多.如果文件很大,那么即使空的Element对象也将消耗过多的空间,在这种情况下,您还必须删除以前看到的Element对象.请注意,您不能安全地删除当前元素. lxml.etree.iterparse文档介绍了此技术.

Usually it is enough to use clear() to empty out the element you are processing. Your memory usage will grow a little but it's not very much. If you have a really huge file, then even the empty Element objects will consume too much and in this case you must also delete previously-seen Element objects. Note that you cannot safely delete the current element. The lxml.etree.iterparse documentation describes this technique.

在这种情况下,每次找到</record>时,您都会处理一条记录,然后删除所有以前的记录元素.

In this case, you will process a record every time a </record> is found, then you will delete all previous record elements.

下面是一个使用无限长的XML文档的示例.它将在解析时打印该进程的内存使用情况.请注意,内存使用情况稳定且不会继续增长.

Below is an example using an infinitely-long XML document. It will print the process's memory usage as it parses. Note that the memory usage is stable and does not continue growing.

from lxml import etree
import resource

class InfiniteXML (object):
    def __init__(self):
        self._root = True
    def read(self, len=None):
        if self._root:
            self._root=False
            return "<?xml version='1.0' encoding='US-ASCII'?><records>\n"
        else:
            return """<record>\n\t<ancestor attribute="value">text value</ancestor>\n</record>\n"""

def parse(fp):
    context = etree.iterparse(fp, events=('end',))
    for action, elem in context:
        if elem.tag=='record':
            # processing goes here
            pass

        #memory usage
        print resource.getrusage(resource.RUSAGE_SELF).ru_maxrss

        # cleanup
        # first empty children from current element
            # This is not absolutely necessary if you are also deleting siblings,
            # but it will allow you to free memory earlier.
        elem.clear()
        # second, delete previous siblings (records)
        while elem.getprevious() is not None:
            del elem.getparent()[0]
        # make sure you have no references to Element objects outside the loop

parse(InfiniteXML())

这篇关于适用于80 + GB XML的Python sax到lxml的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆