加载巨大的XML文件和处理的MemoryError [英] Loading huge XML files and dealing with MemoryError

查看:534
本文介绍了加载巨大的XML文件和处理的MemoryError的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个非常大的XML文件(20GB准确的说,是的,我需要的所有的话)。当我尝试加载该文件,我收到此错误:

I have a very large XML file (20GB to be exact, and yes, I need all of it). When I attempt to load the file, I receive this error:

Python(23358) malloc: *** mmap(size=140736680968192) failed (error code=12)
*** error: can't allocate region
*** set a breakpoint in malloc_error_break to debug
Traceback (most recent call last):
  File "file.py", line 5, in <module>
    code = xml.read()
MemoryError

这是当前code我有,读取XML文件:

This is the current code I have, to read the XML file:

from bs4 import BeautifulSoup
xml = open('pages_full.xml', 'r')
code = xml.read()
xml.close()
soup = BeautifulSoup(code)

现在,我怎么会去消除这个错误,并能够继续对剧本的工作。我会尝试拆分文件到单独的文件,但我不知道怎么会影响BeautifulSoup以及XML数据,我宁愿不这么做。

Now, how would I go about to eliminating this error and be able to continue working on the script. I would try splitting the file into separate files, but as I don't know how that would affect BeautifulSoup as well as the XML data, I'd rather not do this.

(XML数据是从维基数据库转储我志愿上,用它来导入不同的时间段的数据,使用从许多页的直接信息)

(The XML data is a database dump from a wiki I volunteer on, using it to import data from different time-periods, using the direct information from many pages)

推荐答案

待办事项的的使用BeautifulSoup尝试如此大解析XML文件。使用 ElementTree的API,而不是。具体来说,使用 iterparse()功能分析文件作为流,处理信息,将通知您的元素,然后的删除的再次元素:

Do not use BeautifulSoup to try and such a large parse XML file. Use the ElementTree API instead. Specifically, use the iterparse() function to parse your file as a stream, handle information as you are notified of elements, then delete the elements again:

from xml.etree import ElementTree as ET

parser = ET.iterparse(filename)

for event, element in parser:
    # element is a whole element
    if element.tag == 'yourelement'
         # do something with this element
         # then clean up
         element.clear()

通过使用事件驱动方式,你永远需要保持的的内存中XML文档,你只能提取您需要什么,丢弃其余部分。

By using a event-driven approach, you never need to hold the whole XML document in memory, you only extract what you need and discard the rest.

查看 iterparse()的教程和文档

See the iterparse() tutorial and documentation.

另外,您还可以使用 LXML库;它提供了相同的API中更快和更featurefull包。

Alternatively, you can also use the lxml library; it offers the same API in a faster and more featurefull package.

这篇关于加载巨大的XML文件和处理的MemoryError的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆