快速数据从文件移动到某些 StringIO [英] fast data move from file to some StringIO

查看:29
本文介绍了快速数据从文件移动到某些 StringIO的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在 Python 中,我有一个文件流,我想将其中的一部分复制到 StringIO 中.我希望这尽可能快,副本最少.

但如果我这样做:

data = file.read(SIZE)流 = StringIO(数据)

我认为已经完成了 2 个副本,不是吗?一个从文件复制到数据,另一个在 StringIO 内部复制到内部缓冲区.我可以避免其中的一份吗?我不需要临时的data,所以我觉得一份就够了

解决方案

简而言之:使用 StringIO 无法避免 2 个副本.

一些假设:

  • 您正在使用 cStringIO,否则优化这么多是愚蠢的.
  • 您要的是速度,而不是内存效率.如果不是,请参阅 Jakob Bowyer 的解决方案,或者如果您的文件是二进制文件,请使用 file.read(SOME_BYTE_COUNT) 使用变体.
  • 您已经在评论中说明了这一点,但为了完整起见:您希望实际编辑内容,而不仅仅是查看内容.

长答案:由于 python 字符串是不可变的,而 StringIO 缓冲区不是,因此迟早必须进行复制;否则你会改变一个不可变的对象!为了实现您的愿望,StringIO 对象需要有一个专门的方法来直接从作为参数给出的文件对象中读取.没有这种方法.

在StringIO 之外,有避免额外复制的解决方案.在我的脑海里,这会将文件直接读入一个可修改的字节数组,没有额外的副本:

将 numpy 导入为 npa = np.fromfile("filename.ext", dtype="uint8")

使用起来可能很麻烦,具体取决于您打算使用的用途,因为它是一个从 0 到 255 的值数组,而不是一个字符数组.但它在功能上等同于 StringIO 对象,并且使用 np.fromstringnp.tostringnp.tofile 和切片符号应该可以帮助您找到你要.您可能还需要 np.insertnp.deletenp.append.

我确信还有其他模块可以做类似的事情.

时间:

这一切有什么重要?走着瞧.我制作了一个 100MB 的文件,largefile.bin.然后我使用这两种方法读入文件并更改第一个字节.

<前>$ python -m timeit -s "import numpy as np" "a = np.fromfile('largefile.bin', 'uint8'); a[0] = 1"10 个循环,最好的 3 个:每个循环 132 毫秒$ python -m timeit -s "from cStringIO import StringIO" "a = StringIO(); a.write(open('largefile.bin').read()); a.seek(0); a.write('1')"10 个循环,最好的 3 个:每个循环 203 毫秒

所以就我而言,使用 StringIO 比使用 numpy 慢 50%.

最后,为了对比,直接编辑文件:

<前>$ python -m timeit "a = open('largefile.bin', 'r+b'); a.seek(0); a.write('1')"10000 个循环,最好的 3 个:每个循环 29.5 微秒

因此,它快了近 4500 倍.当然,这在很大程度上取决于您要对文件做什么.改变第一个字节几乎没有代表性.但是使用这种方法,您确实在其他两个方面领先一步,而且由于大多数操作系统具有良好的磁盘缓冲,因此速度也可能非常好.

(如果您不允许编辑文件,因此想要避免制作工作副本的成本,有几种可能的方法可以提高速度.如果您可以选择文件系统,Btrfs 有一个 copy-on-write 文件复制操作——使复制文件的行为几乎是即时的.使用 LVM 任何文件系统的快照.)

In Python I have a file stream, and I want to copy some part of it into a StringIO. I want this to be fastest as possible, with minimum copy.

But if I do:

data = file.read(SIZE)
stream = StringIO(data)

I think 2 copies was done, no? One copy into data from file, another copy inside StringIO into internal buffer. Can I avoid one of the copies? I don't need temporary data, so I think one copy should be enough

解决方案

In short: you can't avoid 2 copies using StringIO.

Some assumptions:

  • You're using cStringIO, otherwise it would be silly to optimize this much.
  • It's speed and not memory efficiency you're after. If not, see Jakob Bowyer's solution, or use a variant using file.read(SOME_BYTE_COUNT) if your file is binary.
  • You've already stated this in the comments, but for completeness: you want to actually edit the contents, not just view it.

Long answer: Since python strings are immutable and the StringIO buffer is not, a copy will have to be made sooner or later; otherwise you'd be altering an immutable object! For what you want to be possible, the StringIO object would need to have a dedicated method that read directly from a file object given as an argument. There is no such method.

Outside of StringIO, there are solutions that avoid the extra copy. Off the top of my head, this will read a file directly into a modifiable byte array, no extra copy:

import numpy as np
a = np.fromfile("filename.ext", dtype="uint8")

It may be cumbersome to work with, depending on the usage you intend, since it's an array of values from 0 to 255, not an array of characters. But it's functionally equivalent to a StringIO object, and using np.fromstring, np.tostring, np.tofile and slicing notation should get you where you want. You might also need np.insert, np.delete and np.append.

I'm sure there are other modules that will do similar things.

TIMEIT:

How much does all this really matter? Well, let's see. I've made a 100MB file, largefile.bin. Then I read in the file using both methods and change the first byte.

$ python -m timeit -s "import numpy as np" "a = np.fromfile('largefile.bin', 'uint8'); a[0] = 1"
10 loops, best of 3: 132 msec per loop
$ python -m timeit -s "from cStringIO import StringIO" "a = StringIO(); a.write(open('largefile.bin').read()); a.seek(0); a.write('1')"
10 loops, best of 3: 203 msec per loop

So in my case, using StringIO is 50% slower than using numpy.

Lastly, for comparison, editing the file directly:

$ python -m timeit "a = open('largefile.bin', 'r+b'); a.seek(0); a.write('1')"
10000 loops, best of 3: 29.5 usec per loop

So, it's nearly 4500 times faster. Of course, it's extremely dependent on what you're going to do with the file. Altering the first byte is hardly representative. But using this method, you do have a head start on the other two, and since most OS's have good buffering of disks, the speed may be very good too.

(If you're not allowed to edit the file and so want to avoid the cost of making a working copy, there are a couple of possible ways to increase the speed. If you can choose the filesystem, Btrfs has a copy-on-write file copy operation -- making the act of taking a copy of a file virtually instant. The same effect can be achieved using an LVM snapshot of any filesystem.)

这篇关于快速数据从文件移动到某些 StringIO的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆