在 Python 2.7 中高效读取 800 GB XML 文件 [英] Efficient reading of 800 GB XML file in Python 2.7
问题描述
我正在 python 2.7 中读取一个 800 GB 的 xml 文件,并使用 etree 迭代解析器对其进行解析.
I am reading an 800 GB xml file in python 2.7 and parsing it with an etree iterative parser.
目前,我只使用 open('foo.txt')
没有缓冲参数.我有点困惑,这是我应该采用的方法,还是应该使用缓冲参数或使用 io 中的某些内容,例如 io.BufferedReader 或 io.open 或 io.TextIOBase.
Currently, I am just using open('foo.txt')
with no buffering argument. I am a little confused whether this is the approach I should take or I should use a buffering argument or use something from io like io.BufferedReader or io.open or io.TextIOBase.
在正确方向上的一个点将不胜感激.
A point in the right direction would be much appreciated.
推荐答案
标准open()
函数 已经,默认情况下,返回一个缓冲文件(如果在您的平台上可用).对于通常完全缓冲的文件对象.
The standard open()
function already, by default, returns a buffered file (if available on your platform). For file objects that is usually fully buffered.
通常在这里意味着 Python 将其留给 C stdlib 实现;它使用 fopen()
调用 (wfopen()
在 Windows 上支持 UTF-16 文件名),这意味着选择文件的默认缓冲;在 Linux 上,我相信那将是 8kb.对于像 XML 解析这样的纯读取操作,这种类型的缓冲正是您想要的.
Usually here means that Python leaves this to the C stdlib implementation; it uses a fopen()
call (wfopen()
on Windows to support UTF-16 filenames), which means that the default buffering for a file is chosen; on Linux I believe that would be 8kb. For a pure-read operation like XML parsing this type of buffering is exactly what you want.
由 iterparse
完成的 XML 解析以 16384 字节 (16kb) 的块读取文件.
The XML parsing done by iterparse
reads the file in chunks of 16384 bytes (16kb).
如果要控制缓冲区大小,请使用 buffering
关键字参数:
If you want to control the buffersize, use the buffering
keyword argument:
open('foo.xml', buffering=(2<<16) + 8) # buffer enough for 8 full parser reads
这将覆盖默认缓冲区大小(我希望它与文件块大小或其倍数匹配).根据这篇文章增加读取缓冲区应该 help,使用至少 4 倍预期读取块大小加上 8 个字节的大小将提高读取性能.在上面的示例中,我将其设置为 ElementTree 读取大小的 8 倍.
which will override the default buffer size (which I'd expect to match the file block size or a multiple thereof). According to this article increasing the read buffer should help, and using a size at least 4 times the expected read block size plus 8 bytes is going to improve read performance. In the above example I've set it to 8 times the ElementTree read size.
io.open()
函数代表新的 Python 3 I/O 对象结构,其中 I/O 已拆分为新的类类型层次结构,为您提供更大的灵活性.代价是更加间接,数据必须通过更多层,Python C 代码本身会做更多的工作,而不是将这些工作留给操作系统.
The io.open()
function represents the new Python 3 I/O structure of objects, where I/O has been split up into a new hierarchy of class types to give you more flexibility. The price is more indirection, more layers for the data to have to travel through, and the Python C code does more work itself instead of leaving that to the OS.
您可以尝试看看 io.open('foo.xml', 'rb', buffering=2<<16)
是否会执行任何更好的.在 rb
模式下打开会给你一个 io.BufferedReader
实例.
You could try and see if io.open('foo.xml', 'rb', buffering=2<<16)
is going to perform any better. Opening in rb
mode will give you a io.BufferedReader
instance.
你不想想使用io.TextIOWrapper
;底层的 expat 解析器需要原始数据,因为它会自行解码您的 XML 文件编码.它只会增加额外的开销;如果你在 r
(textmode) 中打开,你会得到这种类型.
You do not want to use io.TextIOWrapper
; the underlying expat parser wants raw data as it'll decode your XML file encoding itself. It would only add extra overhead; you get this type if you open in r
(textmode) instead.
使用 io.open()
可能会给你更多的灵活性和更丰富的 API,但是底层的 C 文件对象是使用 open()
而不是 打开的fopen()
,并且所有缓冲都由 Python io.BufferedIOBase
实现处理.
Using io.open()
may give you more flexibility and a richer API, but the underlying C file object is opened using open()
instead of fopen()
, and all buffering is handled by the Python io.BufferedIOBase
implementation.
你的问题将是处理这个野兽,而不是文件读取,我想.在读取 800GB 的文件时,无论如何都会对磁盘缓存进行大量拍摄.
Your problem will be processing this beast, not the file reads, I think. The disk cache will be pretty much shot anyway when reading a 800GB file.
这篇关于在 Python 2.7 中高效读取 800 GB XML 文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!