tempfile.TemporaryFile与StringIO [英] tempfile.TemporaryFile vs. StringIO
问题描述
我写了一个基准测试,比较了 ZOCache 的不同字符串连接方法.
I've written a little benchmark where i compare different string concatenating methods for ZOCache.
所以在这里看起来像tempfile.TemporaryFile比其他任何东西都快:
So it looks here like tempfile.TemporaryFile is faster than anything else:
$ python src/ZOCache/tmp_benchmark.py
3.00407409668e-05 TemporaryFile
0.385630846024 SpooledTemporaryFile
0.299962997437 BufferedRandom
0.0849719047546 io.StringIO
0.113346099854 concat
我一直在使用的基准代码:
The benchmark code i've been using:
#!/usr/bin/python
from __future__ import print_function
import io
import timeit
import tempfile
class Error(Exception):
pass
def bench_temporaryfile():
with tempfile.TemporaryFile(bufsize=10*1024*1024) as out:
for i in range(0, 100):
out.write(b"Value = ")
out.write(bytes(i))
out.write(b" ")
# Get string.
out.seek(0)
contents = out.read()
out.close()
# Test first letter.
if contents[0:5] != b"Value":
raise Error
def bench_spooledtemporaryfile():
with tempfile.SpooledTemporaryFile(max_size=10*1024*1024) as out:
for i in range(0, 100):
out.write(b"Value = ")
out.write(bytes(i))
out.write(b" ")
# Get string.
out.seek(0)
contents = out.read()
out.close()
# Test first letter.
if contents[0:5] != b"Value":
raise Error
def bench_BufferedRandom():
# 1. BufferedRandom
with io.open('out.bin', mode='w+b') as fp:
with io.BufferedRandom(fp, buffer_size=10*1024*1024) as out:
for i in range(0, 100):
out.write(b"Value = ")
out.write(bytes(i))
out.write(b" ")
# Get string.
out.seek(0)
contents = out.read()
# Test first letter.
if contents[0:5] != b'Value':
raise Error
def bench_stringIO():
# 1. Use StringIO.
out = io.StringIO()
for i in range(0, 100):
out.write(u"Value = ")
out.write(unicode(i))
out.write(u" ")
# Get string.
contents = out.getvalue()
out.close()
# Test first letter.
if contents[0] != 'V':
raise Error
def bench_concat():
# 2. Use string appends.
data = ""
for i in range(0, 100):
data += u"Value = "
data += unicode(i)
data += u" "
# Test first letter.
if data[0] != u'V':
raise Error
if __name__ == '__main__':
print(str(timeit.timeit('bench_temporaryfile()', setup="from __main__ import bench_temporaryfile", number=1000)) + " TemporaryFile")
print(str(timeit.timeit('bench_spooledtemporaryfile()', setup="from __main__ import bench_spooledtemporaryfile", number=1000)) + " SpooledTemporaryFile")
print(str(timeit.timeit('bench_BufferedRandom()', setup="from __main__ import bench_BufferedRandom", number=1000)) + " BufferedRandom")
print(str(timeit.timeit("bench_stringIO()", setup="from __main__ import bench_stringIO", number=1000)) + " io.StringIO")
print(str(timeit.timeit("bench_concat()", setup="from __main__ import bench_concat", number=1000)) + " concat")
编辑Python3.4.3 + io.BytesIO
python3 ./src/ZOCache/tmp_benchmark.py
2.689500024644076e-05 TemporaryFile
0.30429405899985795 SpooledTemporaryFile
0.348170792000019 BufferedRandom
0.0764778530001422 io.BytesIO
0.05162201000030109 concat
带有io.BytesIO的新源:
New source with io.BytesIO:
#!/usr/bin/python3
from __future__ import print_function
import io
import timeit
import tempfile
class Error(Exception):
pass
def bench_temporaryfile():
with tempfile.TemporaryFile() as out:
for i in range(0, 100):
out.write(b"Value = ")
out.write(bytes(str(i), 'utf-8'))
out.write(b" ")
# Get string.
out.seek(0)
contents = out.read()
out.close()
# Test first letter.
if contents[0:5] != b"Value":
raise Error
def bench_spooledtemporaryfile():
with tempfile.SpooledTemporaryFile(max_size=10*1024*1024) as out:
for i in range(0, 100):
out.write(b"Value = ")
out.write(bytes(str(i), 'utf-8'))
out.write(b" ")
# Get string.
out.seek(0)
contents = out.read()
out.close()
# Test first letter.
if contents[0:5] != b"Value":
raise Error
def bench_BufferedRandom():
# 1. BufferedRandom
with io.open('out.bin', mode='w+b') as fp:
with io.BufferedRandom(fp, buffer_size=10*1024*1024) as out:
for i in range(0, 100):
out.write(b"Value = ")
out.write(bytes(i))
out.write(b" ")
# Get string.
out.seek(0)
contents = out.read()
# Test first letter.
if contents[0:5] != b'Value':
raise Error
def bench_BytesIO():
# 1. Use StringIO.
out = io.BytesIO()
for i in range(0, 100):
out.write(b"Value = ")
out.write(bytes(str(i), 'utf-8'))
out.write(b" ")
# Get string.
contents = out.getvalue()
out.close()
# Test first letter.
if contents[0:5] != b'Value':
raise Error
def bench_concat():
# 2. Use string appends.
data = ""
for i in range(0, 100):
data += "Value = "
data += str(i)
data += " "
# Test first letter.
if data[0] != 'V':
raise Error
if __name__ == '__main__':
print(str(timeit.timeit('bench_temporaryfile()', setup="from __main__ import bench_temporaryfile", number=1000)) + " TemporaryFile")
print(str(timeit.timeit('bench_spooledtemporaryfile()', setup="from __main__ import bench_spooledtemporaryfile", number=1000)) + " SpooledTemporaryFile")
print(str(timeit.timeit('bench_BufferedRandom()', setup="from __main__ import bench_BufferedRandom", number=1000)) + " BufferedRandom")
print(str(timeit.timeit("bench_BytesIO()", setup="from __main__ import bench_BytesIO", number=1000)) + " io.BytesIO")
print(str(timeit.timeit("bench_concat()", setup="from __main__ import bench_concat", number=1000)) + " concat")
每个平台都适用吗?如果是这样,为什么?
Is that true for every platform? And if so why?
具有固定基准(和固定代码)的结果:
0.2675984420002351 TemporaryFile
0.28104681999866443 SpooledTemporaryFile
0.3555715570000757 BufferedRandom
0.10379689100045653 io.BytesIO
0.05650951399911719 concat
推荐答案
Your biggest problem: Per tdelaney, you never actually ran the TemporaryFile
test; you omitted the parens in the timeit
snippet (and only for that test, the others actually ran). So you were timing the time taken to lookup the name bench_temporaryfile
, but not to actually call it. Change:
print(str(timeit.timeit('bench_temporaryfile', setup="from __main__ import bench_temporaryfile", number=1000)) + " TemporaryFile")
收件人:
print(str(timeit.timeit('bench_temporaryfile()', setup="from __main__ import bench_temporaryfile", number=1000)) + " TemporaryFile")
(添加括号以使其成为调用)进行修复.
(adding parens to make it a call) to fix.
其他一些问题:
io.StringIO
与您的其他测试用例根本不同.具体来说,您要测试的所有其他类型都以二进制模式运行,读写str
,并避免行尾转换. io.StringIO
使用Python 3样式字符串(Python 2中为unicode
),您的测试通过使用不同的文字并将其转换为unicode
而不是bytes
来确认.这增加了很多编码和解码开销,并且使用了更多的内存(对于同一数据,unicode
使用str
内存的2-4倍,这意味着更多的分配器开销,更多的复制开销等).
io.StringIO
is fundamentally different from your other test cases. Specifically, all the other types you're testing with operate in binary mode, reading and writing str
, and avoiding line ending conversions. io.StringIO
uses Python 3 style strings (unicode
in Python 2), which your tests acknowledge by using different literals and converting to unicode
instead of bytes
. This adds a lot of encoding and decoding overhead, as well as using a lot more memory (unicode
uses 2-4x the memory of str
for the same data, which means more allocator overhead, more copy overhead, etc.).
另一个主要区别是您为TemporaryFile
设置了一个真正巨大的bufsize
;几乎不需要发生系统调用,并且大多数写操作只是追加到缓冲区中的连续内存中.相比之下,io.StringIO
将存储写入的各个值,并且仅当您使用getvalue()
要求它们时才将它们连接在一起.
The other major difference is that you're setting a truly huge bufsize
for TemporaryFile
; few system calls would need to occur, and most writes are just appending to contiguous memory in the buffer. By contrast, io.StringIO
is storing the individual values written, and only joining them together when you ask for them with getvalue()
.
此外,最后,您认为通过使用bytes
构造函数可以实现向前兼容,但事实并非如此.在Python 2中,bytes
是str
的别名,因此bytes(10)
返回'10'
,但是在Python 3中,bytes
是完全不同的东西,向其传递整数将返回初始化为零的bytes
大小的对象,bytes(10)
返回b'\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00'
.
Also, lastly, you think you're being forward compatible by using the bytes
constructor, but you're not; in Python 2 bytes
is an alias for str
, so bytes(10)
returns '10'
, but in Python 3, bytes
is a totally different thing, and passing an integer to it returns a zero initialized bytes
object of that size, bytes(10)
returns b'\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00'
.
如果您想要一个公平的测试用例,请至少切换到cStringIO.StringIO
或io.BytesIO
而不是io.StringIO
并统一编写bytes
.通常,您不会自己显式设置TemporaryFile
之类的缓冲区大小,因此您可以考虑删除该大小.
If you want a fair test case, at the very least switch to cStringIO.StringIO
or io.BytesIO
instead of io.StringIO
and write bytes
uniformly. Typically, you wouldn't explicitly set the buffer size for TemporaryFile
and the like yourself, so you might consider dropping that.
在我自己的使用python 2.7.10的Linux x64上的测试中,使用ipython的%timeit
魔术,排名是:
In my own tests on Linux x64 with Python 2.7.10, using ipython's %timeit
magic, the ranking is:
-
io.BytesIO
每个循环〜48μs -
io.StringIO
每个循环〜54μs(因此unicode
开销不会增加太多) -
cStringIO.StringIO
每个循环〜83μs
每个循环 -
TemporaryFile
〜2.8 ms (注意单位; ms比μs长1000倍)
io.BytesIO
~48 μs per loopio.StringIO
~54 μs per loop (sounicode
overhead didn't add much)cStringIO.StringIO
~83 μs per loopTemporaryFile
~2.8 ms per loop (note units; ms is 1000x longer than μs)
这并没有回到默认的缓冲区大小(我保留了测试中的显式bufsize
).我怀疑TemporaryFile
的行为会有很大不同(取决于操作系统和临时文件的处理方式;某些系统可能只存储在内存中,其他系统可能存储在/tmp
中,但是,当然,/tmp
可能只是还是要成为RAMdisk).
And that's without going back to default buffer sizes (I kept the explicit bufsize
from your tests). I suspect the behavior of TemporaryFile
will vary a lot more (depending on the OS and how temporary files are handled; some systems might just store in memory, others might store in /tmp
, but of course, /tmp
might just be a RAMdisk anyway).
某事告诉我您可能有一个设置,其中TemporaryFile
本质上是一个普通的内存缓冲区,永远不会进入文件系统,而我的最终可能会最终停留在持久性存储上(如果只是短期的话);内存中发生的事情是可以预见的,但是当涉及到文件系统时(根据操作系统,内核设置等,TemporaryFile
可以做到),系统之间的行为会有很大差异.
Something tells me you may have a setup where the TemporaryFile
is basically a plain memory buffer that never goes to the file system, where mine may be ultimately ending up on persistent storage (if only for short periods); stuff happening in memory is predictable, but when you involve the file system (which TemporaryFile
can, depending on OS, kernel settings, etc.), the behavior will differ a great deal between systems.
这篇关于tempfile.TemporaryFile与StringIO的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!