在Python 2.7中高效读取800 GB XML文件 [英] Efficient reading of 800 GB XML file in Python 2.7
问题描述
我正在读取python 2.7中的800 GB xml文件并使用etree迭代解析器解析它。
I am reading an 800 GB xml file in python 2.7 and parsing it with an etree iterative parser.
目前,我只是使用 open('foo.txt')
,没有缓冲参数。我有点困惑这是否是我应该采取的方法,或者我应该使用缓冲参数或使用io中的东西,如io.BufferedReader或io.open或io.TextIOBase。
Currently, I am just using open('foo.txt')
with no buffering argument. I am a little confused whether this is the approach I should take or I should use a buffering argument or use something from io like io.BufferedReader or io.open or io.TextIOBase.
非常感谢正确方向上的一点。
A point in the right direction would be much appreciated.
推荐答案
标准 open()
函数已经返回一个缓冲文件(如果有的话)在你的平台上)。对于通常完全缓冲的文件对象。
The standard open()
function already, by default, returns a buffered file (if available on your platform). For file objects that is usually fully buffered.
通常这里意味着Python将此保留给C stdlib实现;它使用 fopen()
来电( wfopen()
在Windows上支持UTF-16文件名),这意味着选择了文件的默认缓冲;在Linux上我相信这将是8kb。对于像XML解析这样的纯读取操作,这种类型的缓冲完全你想要的。
Usually here means that Python leaves this to the C stdlib implementation; it uses a fopen()
call (wfopen()
on Windows to support UTF-16 filenames), which means that the default buffering for a file is chosen; on Linux I believe that would be 8kb. For a pure-read operation like XML parsing this type of buffering is exactly what you want.
由<$ c $完成的XML解析c> iterparse 以16384字节(16kb)的块读取文件。
The XML parsing done by iterparse
reads the file in chunks of 16384 bytes (16kb).
如果要控制buffersize,请使用缓冲
关键字参数:
If you want to control the buffersize, use the buffering
keyword argument:
open('foo.xml', buffering=(2<<16) + 8) # buffer enough for 8 full parser reads
将覆盖默认值缓冲区大小(我期望匹配文件块大小或其倍数)。根据本文增加读取缓冲区应该 help,并且使用至少4倍的预期读取块大小加上8个字节的大小将提高读取性能。在上面的示例中,我将其设置为ElementTree读取大小的8倍。
which will override the default buffer size (which I'd expect to match the file block size or a multiple thereof). According to this article increasing the read buffer should help, and using a size at least 4 times the expected read block size plus 8 bytes is going to improve read performance. In the above example I've set it to 8 times the ElementTree read size.
io.open()
function 表示对象的新Python 3 I / O结构,其中I / O已被拆分为新的类类型层次结构,以提供更大的灵活性。价格更加间接,数据需要通过更多层,而Python C代码本身更有效,而不是将其留给操作系统。
The io.open()
function represents the new Python 3 I/O structure of objects, where I/O has been split up into a new hierarchy of class types to give you more flexibility. The price is more indirection, more layers for the data to have to travel through, and the Python C code does more work itself instead of leaving that to the OS.
你可以尝试查看 io.open('foo.xml','rb',buffering = 2<< 16)
是否会表现更好。以 rb
模式打开将为您提供 io.BufferedReader
实例。
You could try and see if io.open('foo.xml', 'rb', buffering=2<<16)
is going to perform any better. Opening in rb
mode will give you a io.BufferedReader
instance.
你不想要使用 io.TextIOWrapper
;底层的expat解析器需要原始数据,因为它将解码您的XML文件编码本身。它只会增加额外的开销;如果你在 r
(textmode)中打开,你会得到这种类型。
You do not want to use io.TextIOWrapper
; the underlying expat parser wants raw data as it'll decode your XML file encoding itself. It would only add extra overhead; you get this type if you open in r
(textmode) instead.
使用 io .open()
可以为您提供更多灵活性和更丰富的API,但是使用 open()
而不是<$来打开底层C文件对象c $ c> fopen(),所有缓冲都由Python io.BufferedIOBase
实现处理。
Using io.open()
may give you more flexibility and a richer API, but the underlying C file object is opened using open()
instead of fopen()
, and all buffering is handled by the Python io.BufferedIOBase
implementation.
我认为你的问题将是处理这个野兽,而不是文件读取。在读取800GB文件时,磁盘缓存无论如何都会被拍摄。
Your problem will be processing this beast, not the file reads, I think. The disk cache will be pretty much shot anyway when reading a 800GB file.
这篇关于在Python 2.7中高效读取800 GB XML文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!