在 Python 中解析大型 XML 文档的最快方法是什么? [英] What is the fastest way to parse large XML docs in Python?

查看:44
本文介绍了在 Python 中解析大型 XML 文档的最快方法是什么?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我目前正在根据 Python Cookbook 的第 12.5 章运行以下代码:

from xml.parsers import expat类元素(对象):def __init__(self, name, attributes):self.name = 姓名self.attributes = 属性self.cdata = ''self.children = []def addChild(self, element):self.children.append(元素)def getAttribute(self,key):返回 self.attributes.get(key)def getData(self):返回self.cdatadef getElements(self, name=''):如果名称:return [c for c in self.children if c.name == name]别的:返回列表(self.children)类 Xml2Obj(对象):def __init__(self):self.root = 无self.nodeStack = []def StartElement(self, name, attributes):元素 = 元素(名称.编码(),属性)如果 self.nodeStack:parent = self.nodeStack[-1]parent.addChild(元素)别的:self.root = 元素self.nodeStack.append(元素)def EndElement(self, name):self.nodeStack.pop()def CharacterData(self,data):如果 data.strip():数据 = data.encode()element = self.nodeStack[-1]element.cdata += 数据def解析(自我,文件名):解析器 = expat.ParserCreate()Parser.StartElementHandler = self.StartElementParser.EndElementHandler = self.EndElementParser.CharacterDataHandler = self.CharacterDataParserStatus = Parser.Parse(open(filename).read(),1)返回 self.root

我正在处理大小约为 1 GB 的 XML 文档.有谁知道解析这些的更快方法?

解决方案

在我看来,您的程序不需要任何 DOM 功能.我会第二次使用 (c)ElementTree 库.如果您使用 cElementTree 模块的 iterparse 函数,您可以通过 xml 工作并在事件发生时对其进行处理.

但是请注意,Fredriks 建议使用 cElementTree iterparse 函数:

<块引用>

要解析大文件,您可以在处理完元素后立即删除它们:

对于事件,迭代中的元素(来源):如果 elem.tag == 记录":... 处理记录元素 ...elem.clear()

<块引用>

上述模式有一个缺点;它不会清除根元素,因此您最终会得到一个带有许多空子元素的单个元素.如果您的文件很大,而不仅仅是很大,这可能是一个问题.要解决此问题,您需要掌握根元素.最简单的方法是启用开始事件,并将对第一个元素的引用保存在变量中:

# 获取一个可迭代对象context = iterparse(source, events=(start", end"))# 把它变成一个迭代器上下文 = iter(上下文)# 获取根元素事件,root = context.next()对于事件,上下文中的元素:如果事件==结束"和 elem.tag == 记录":... 处理记录元素 ...根清除()

lxml.iterparse() 不允许这样做.

以前的在 Python 3.7 上不起作用,请考虑以下方法来获取第一个元素.

导入 xml.etree.ElementTree 作为 ET# 获取一个可迭代对象.context = ET.iterparse(source, events=(start", end"))对于枚举(上下文)中的索引(事件,元素):# 获取根元素.如果索引 == 0:根 = 元素如果事件==结束"和 elem.tag == 记录":# ... 处理记录元素 ...根清除()

I am currently running the following code based on Chapter 12.5 of the Python Cookbook:

from xml.parsers import expat

class Element(object):
    def __init__(self, name, attributes):
        self.name = name
        self.attributes = attributes
        self.cdata = ''
        self.children = []
    def addChild(self, element):
        self.children.append(element)
    def getAttribute(self,key):
        return self.attributes.get(key)
    def getData(self):
        return self.cdata
    def getElements(self, name=''):
        if name:
            return [c for c in self.children if c.name == name]
        else:
            return list(self.children)

class Xml2Obj(object):
    def __init__(self):
        self.root = None
        self.nodeStack = []
    def StartElement(self, name, attributes):
        element = Element(name.encode(), attributes)
        if self.nodeStack:
            parent = self.nodeStack[-1]
            parent.addChild(element)
        else:
            self.root = element
        self.nodeStack.append(element)
    def EndElement(self, name):
        self.nodeStack.pop()
    def CharacterData(self,data):
        if data.strip():
            data = data.encode()
            element = self.nodeStack[-1]
            element.cdata += data
    def Parse(self, filename):
        Parser = expat.ParserCreate()
        Parser.StartElementHandler = self.StartElement
        Parser.EndElementHandler = self.EndElement
        Parser.CharacterDataHandler = self.CharacterData
        ParserStatus = Parser.Parse(open(filename).read(),1)
        return self.root

I am working with XML documents of about 1 GB in size. Does anyone know a faster way to parse these?

解决方案

I looks to me as if you do not need any DOM capabilities from your program. I would second the use of the (c)ElementTree library. If you use the iterparse function of the cElementTree module, you can work your way through the xml and deal with the events as they occur.

Note however, Fredriks advice on using cElementTree iterparse function:

to parse large files, you can get rid of elements as soon as you’ve processed them:

for event, elem in iterparse(source):
    if elem.tag == "record":
        ... process record elements ...
        elem.clear()

The above pattern has one drawback; it does not clear the root element, so you will end up with a single element with lots of empty child elements. If your files are huge, rather than just large, this might be a problem. To work around this, you need to get your hands on the root element. The easiest way to do this is to enable start events, and save a reference to the first element in a variable:

# get an iterable
context = iterparse(source, events=("start", "end"))

# turn it into an iterator
context = iter(context)

# get the root element
event, root = context.next()

for event, elem in context:
    if event == "end" and elem.tag == "record":
        ... process record elements ...
        root.clear()

The lxml.iterparse() does not allow this.

The previous does not work on Python 3.7, consider the following way to get the first element.

import xml.etree.ElementTree as ET

# Get an iterable.
context = ET.iterparse(source, events=("start", "end"))
    
for index, (event, elem) in enumerate(context):
    # Get the root element.
    if index == 0:
        root = elem
    if event == "end" and elem.tag == "record":
        # ... process record elements ...
        root.clear()

这篇关于在 Python 中解析大型 XML 文档的最快方法是什么?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆