用Python解析大型XML文档的最快方法是什么? [英] What is the fastest way to parse large XML docs in Python?

查看:318
本文介绍了用Python解析大型XML文档的最快方法是什么?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我目前正在根据Python Cookbook的第12.5章运行以下代码:

I am currently running the following code based on Chapter 12.5 of the Python Cookbook:

from xml.parsers import expat

class Element(object):
    def __init__(self, name, attributes):
        self.name = name
        self.attributes = attributes
        self.cdata = ''
        self.children = []
    def addChild(self, element):
        self.children.append(element)
    def getAttribute(self,key):
        return self.attributes.get(key)
    def getData(self):
        return self.cdata
    def getElements(self, name=''):
        if name:
            return [c for c in self.children if c.name == name]
        else:
            return list(self.children)

class Xml2Obj(object):
    def __init__(self):
        self.root = None
        self.nodeStack = []
    def StartElement(self, name, attributes):
        element = Element(name.encode(), attributes)
        if self.nodeStack:
            parent = self.nodeStack[-1]
            parent.addChild(element)
        else:
            self.root = element
        self.nodeStack.append(element)
    def EndElement(self, name):
        self.nodeStack.pop()
    def CharacterData(self,data):
        if data.strip():
            data = data.encode()
            element = self.nodeStack[-1]
            element.cdata += data
    def Parse(self, filename):
        Parser = expat.ParserCreate()
        Parser.StartElementHandler = self.StartElement
        Parser.EndElementHandler = self.EndElement
        Parser.CharacterDataHandler = self.CharacterData
        ParserStatus = Parser.Parse(open(filename).read(),1)
        return self.root

我正在处理大小约为1 GB的XML文档.有人知道解析这些内容的更快方法吗?

I am working with XML documents of about 1 GB in size. Does anyone know a faster way to parse these?

推荐答案

我看上去好像不需要程序中的任何DOM功能.我将第二次使用(c)ElementTree库.如果使用cElementTree模块的iterparse函数,则可以遍历xml并在事件发生时对其进行处理.

I looks to me as if you do not need any DOM capabilities from your program. I would second the use of the (c)ElementTree library. If you use the iterparse function of the cElementTree module, you can work your way through the xml and deal with the events as they occur.

但是请注意,Fredriks关于使用cElementTree iterparse函数的建议:

Note however, Fredriks advice on using cElementTree iterparse function:

要解析大型文件,您可以在处理完元素后立即删除它们:

to parse large files, you can get rid of elements as soon as you’ve processed them:

for event, elem in iterparse(source):
    if elem.tag == "record":
        ... process record elements ...
        elem.clear()

上述模式有一个缺点;它不会清除根元素,因此最终将得到一个带有大量空子元素的单个元素.如果您的文件很大,而不只是很大,这可能是一个问题.要变通解决此问题,您需要动手使用root元素.最简单的方法是启用启动事件,并将对第一个元素的引用保存在变量中:

The above pattern has one drawback; it does not clear the root element, so you will end up with a single element with lots of empty child elements. If your files are huge, rather than just large, this might be a problem. To work around this, you need to get your hands on the root element. The easiest way to do this is to enable start events, and save a reference to the first element in a variable:

# get an iterable
context = iterparse(source, events=("start", "end"))

# turn it into an iterator
context = iter(context)

# get the root element
event, root = context.next()

for event, elem in context:
    if event == "end" and elem.tag == "record":
        ... process record elements ...
        root.clear()

lxml.iterparse()不允许这样做.

以前的版本在Python 3.7上不起作用,请考虑以下方法来获取第一个元素.

The previous does not work on Python 3.7, consider the following way to get the first element.

import xml.etree.ElementTree as ET

# get an iterable
context = ET.iterparse(source, events=("start", "end"))

is_first = True

for event, elem in context:
    # get the root element
    if is_first:
        root = elem
        is_first = False
    if event == "end" and elem.tag == "record":
        ... process record elements ...
        root.clear()

这篇关于用Python解析大型XML文档的最快方法是什么?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆