在 Python 2.7 中高效读取 800 GB XML 文件 [英] Efficient reading of 800 GB XML file in Python 2.7

查看:24
本文介绍了在 Python 2.7 中高效读取 800 GB XML 文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在 python 2.7 中读取一个 800 GB 的 xml 文件,并使用 etree 迭代解析器对其进行解析.

I am reading an 800 GB xml file in python 2.7 and parsing it with an etree iterative parser.

目前,我只使用 open('foo.txt') 没有缓冲参数.我有点困惑,这是我应该采用的方法,还是应该使用缓冲参数或使用 io 中的某些内容,例如 io.BufferedReader 或 io.open 或 io.TextIOBase.

Currently, I am just using open('foo.txt') with no buffering argument. I am a little confused whether this is the approach I should take or I should use a buffering argument or use something from io like io.BufferedReader or io.open or io.TextIOBase.

在正确方向上的一个点将不胜感激.

A point in the right direction would be much appreciated.

推荐答案

标准open() 函数 已经,默认情况下,返回一个缓冲文件(如果在您的平台上可用).对于通常完全缓冲的文件对象.

The standard open() function already, by default, returns a buffered file (if available on your platform). For file objects that is usually fully buffered.

通常在这里意味着 Python 将其留给 C stdlib 实现;它使用 fopen() 调用 (wfopen() 在 Windows 上支持 UTF-16 文件名),这意味着选择文件的默认缓冲;在 Linux 上,我相信那将是 8kb.对于像 XML 解析这样的纯读取操作,这种类型的缓冲正是您想要的.

Usually here means that Python leaves this to the C stdlib implementation; it uses a fopen() call (wfopen() on Windows to support UTF-16 filenames), which means that the default buffering for a file is chosen; on Linux I believe that would be 8kb. For a pure-read operation like XML parsing this type of buffering is exactly what you want.

iterparse 完成的 XML 解析以 16384 字节 (16kb) 的块读取文件.

The XML parsing done by iterparse reads the file in chunks of 16384 bytes (16kb).

如果要控制缓冲区大小,请使用 buffering 关键字参数:

If you want to control the buffersize, use the buffering keyword argument:

open('foo.xml', buffering=(2<<16) + 8)  # buffer enough for 8 full parser reads

这将覆盖默认缓冲区大小(我希望它与文件块大小或其倍数匹配).根据这篇文章增加读取缓冲区应该 help,使用至少 4 倍预期读取块大小加上 8 个字节的大小将提高读取性能.在上面的示例中,我将其设置为 ElementTree 读取大小的 8 倍.

which will override the default buffer size (which I'd expect to match the file block size or a multiple thereof). According to this article increasing the read buffer should help, and using a size at least 4 times the expected read block size plus 8 bytes is going to improve read performance. In the above example I've set it to 8 times the ElementTree read size.

io.open() 函数代表新的 Python 3 I/O 对象结构,其中 I/O 已拆分为新的类类型层次结构,为您提供更大的灵活性.代价是更加间接,数据必须通过更多层,Python C 代码本身会做更多的工作,而不是将这些工作留给操作系统.

The io.open() function represents the new Python 3 I/O structure of objects, where I/O has been split up into a new hierarchy of class types to give you more flexibility. The price is more indirection, more layers for the data to have to travel through, and the Python C code does more work itself instead of leaving that to the OS.

可以尝试看看 io.open('foo.xml', 'rb', buffering=2<<16) 是否会执行任何更好的.在 rb 模式下打开会给你一个 io.BufferedReader 实例.

You could try and see if io.open('foo.xml', 'rb', buffering=2<<16) is going to perform any better. Opening in rb mode will give you a io.BufferedReader instance.

不想想使用io.TextIOWrapper;底层的 expat 解析器需要原始数据,因为它会自行解码您的 XML 文件编码.它只会增加额外的开销;如果你在 r (textmode) 中打开,你会得到这种类型.

You do not want to use io.TextIOWrapper; the underlying expat parser wants raw data as it'll decode your XML file encoding itself. It would only add extra overhead; you get this type if you open in r (textmode) instead.

使用 io.open() 可能会给你更多的灵活性和更丰富的 API,但是底层的 C 文件对象是使用 open() 而不是 打开的fopen(),并且所有缓冲都由 Python io.BufferedIOBase 实现处理.

Using io.open() may give you more flexibility and a richer API, but the underlying C file object is opened using open() instead of fopen(), and all buffering is handled by the Python io.BufferedIOBase implementation.

你的问题将是处理这个野兽,而不是文件读取,我想.在读取 800GB 的文件时,无论如何都会对磁盘缓存进行大量拍摄.

Your problem will be processing this beast, not the file reads, I think. The disk cache will be pretty much shot anyway when reading a 800GB file.

这篇关于在 Python 2.7 中高效读取 800 GB XML 文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆