解析python非常大的xml文件时出现问题 [英] Troubles while parsing with python very large xml file

查看:46
本文介绍了解析python非常大的xml文件时出现问题的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个大的 xml 文件(大约 84MB),格式如下:

I have a large xml file (about 84MB) which is in this form:

<books>
    <book>...</book>
    ....
    <book>...</book>
</books>

我的目标是提取每一本书并获取其属性.我尝试如下解析它(就像我对其他 xml 文件所做的那样):

My goal is to extract every single book and get its properties. I tried to parse it (as I did with other xml files) as follows:

from xml.dom.minidom import parse, parseString

fd = "myfile.xml"
parser = parse(fd)
## other python code here

但是代码似乎在解析指令中失败了.为什么会发生这种情况,我该如何解决?

but the code seems to fail in the parse instruction. Why is this happening and how can I solve this?

我应该指出该文件可能包含希腊语、西班牙语和阿拉伯语字符.

I should point out that the file may contain greek, spanish and arabic characters.

这是我在 ipython 中得到的输出:

This is the output i got in ipython:

In [2]: fd = "myfile.xml"

In [3]: parser = parse(fd)
Killed

我想指出的是,计算机在执行过程中死机,所以这可能与如下所述的内存消耗有关.

I would like to point out that the computer freezes during the execution, so this may be related to memory consumption as stated below.

推荐答案

我强烈建议在这里使用 SAX 解析器.我不建议在任何大于几兆字节的 XML 文档上使用 minidom;我已经看到它在一个大约 10MB 的 XML 文档中使用了大约 400MB 的 RAM 读取.我怀疑您遇到的问题是由 minidom 请求过多内存引起的.

I would strongly recommend using a SAX parser here. I wouldn't recommend using minidom on any XML document larger than a few megabytes; I've seen it use about 400MB of RAM reading in an XML document that was about 10MB in size. I suspect the problems you are having are being caused by minidom requesting too much memory.

Python 带有一个 XML SAX 解析器.要使用它,请执行以下操作.

Python comes with an XML SAX parser. To use it, do something like the following.

from xml.sax.handlers import ContentHandler
from xml.sax import parse

class MyContentHandler(ContentHandler):
    # override various ContentHandler methods as needed...


handler = MyContentHandler()
parse("mydata.xml", handler)

您的 ContentHandler 子类将覆盖 ContentHandler(例如 startElementstartElementNSendElementendElementNScharacters.这些处理由 SAX 解析器在读取您的 XML 文档时生成的事件.

Your ContentHandler subclass will override various methods in ContentHandler (such as startElement, startElementNS, endElement, endElementNS or characters. These handle events generated by the SAX parser as it reads your XML document in.

SAX 是一种比 DOM 更低级"的处理 XML 的方式;除了从文档中提取相关数据之外,您的 ContentHandler 还需要跟踪当前包含哪些元素.不过,从好的方面来说,由于 SAX 解析器不会将整个文档保存在内存中,因此它们可以处理任何大小的 XML 文档,包括比您的文档大的文档.

SAX is a more 'low-level' way to handle XML than DOM; in addition to pulling out the relevant data from the document, your ContentHandler will need to do work keeping track of what elements it is currently inside. On the upside, however, as SAX parsers don't keep the whole document in memory, they can handle XML documents of potentially any size, including those larger than yours.

我还没有尝试在这种大小的 XML 文档上使用其他 DOM 解析器,例如 lxml,但我怀疑 lxml 仍然需要相当长的时间并使用大量内存来解析您的 XML 文档.如果每次运行代码时都必须等待它读取 84MB 的 XML 文档,这可能会减慢您的开发速度.

I haven't tried other using DOM parsers such as lxml on XML documents of this size, but I suspect that lxml will still take a considerable time and use a considerable amount of memory to parse your XML document. That could slow down your development if every time you run your code you have to wait for it to read in an 84MB XML document.

最后,我不相信您提到的希腊语、西班牙语和阿拉伯语字符会引起问题.

Finally, I don't believe the Greek, Spanish and Arabic characters you mention will cause a problem.

这篇关于解析python非常大的xml文件时出现问题的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆