tempfile.TemporaryFile与StringIO [英] tempfile.TemporaryFile vs. StringIO

查看:143
本文介绍了tempfile.TemporaryFile与StringIO的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我写了一个基准测试,比较了 ZOCache 的不同字符串连接方法.

I've written a little benchmark where i compare different string concatenating methods for ZOCache.

所以在这里看起来像tempfile.TemporaryFile比其他任何东西都快:

So it looks here like tempfile.TemporaryFile is faster than anything else:

$ python src/ZOCache/tmp_benchmark.py 
3.00407409668e-05 TemporaryFile
0.385630846024 SpooledTemporaryFile
0.299962997437 BufferedRandom
0.0849719047546 io.StringIO
0.113346099854 concat

我一直在使用的基准代码:

The benchmark code i've been using:

#!/usr/bin/python
from __future__ import print_function
import io
import timeit
import tempfile


class Error(Exception):
    pass


def bench_temporaryfile():
    with tempfile.TemporaryFile(bufsize=10*1024*1024) as out:
        for i in range(0, 100):
            out.write(b"Value = ")
            out.write(bytes(i))
            out.write(b" ")

        # Get string.
        out.seek(0)
        contents = out.read()
        out.close()
        # Test first letter.
        if contents[0:5] != b"Value":
            raise Error


def bench_spooledtemporaryfile():
    with tempfile.SpooledTemporaryFile(max_size=10*1024*1024) as out:
        for i in range(0, 100):
            out.write(b"Value = ")
            out.write(bytes(i))
            out.write(b" ")

        # Get string.
        out.seek(0)
        contents = out.read()
        out.close()
        # Test first letter.
        if contents[0:5] != b"Value":
            raise Error


def bench_BufferedRandom():
    # 1. BufferedRandom
    with io.open('out.bin', mode='w+b') as fp:
        with io.BufferedRandom(fp, buffer_size=10*1024*1024) as out:
            for i in range(0, 100):
                out.write(b"Value = ")
                out.write(bytes(i))
                out.write(b" ")

            # Get string.
            out.seek(0)
            contents = out.read()
            # Test first letter.
            if contents[0:5] != b'Value':
                raise Error


def bench_stringIO():
    # 1. Use StringIO.
    out = io.StringIO()
    for i in range(0, 100):
        out.write(u"Value = ")
        out.write(unicode(i))
        out.write(u" ")

    # Get string.
    contents = out.getvalue()
    out.close()
    # Test first letter.
    if contents[0] != 'V':
        raise Error


def bench_concat():
    # 2. Use string appends.
    data = ""
    for i in range(0, 100):
        data += u"Value = "
        data += unicode(i)
        data += u" "
    # Test first letter.
    if data[0] != u'V':
        raise Error


if __name__ == '__main__':
    print(str(timeit.timeit('bench_temporaryfile()', setup="from __main__ import bench_temporaryfile", number=1000)) + " TemporaryFile")
    print(str(timeit.timeit('bench_spooledtemporaryfile()', setup="from __main__ import bench_spooledtemporaryfile", number=1000)) + " SpooledTemporaryFile")
    print(str(timeit.timeit('bench_BufferedRandom()', setup="from __main__ import bench_BufferedRandom", number=1000)) + " BufferedRandom")
    print(str(timeit.timeit("bench_stringIO()", setup="from __main__ import bench_stringIO", number=1000)) + " io.StringIO")
    print(str(timeit.timeit("bench_concat()", setup="from __main__ import bench_concat", number=1000)) + " concat")

编辑Python3.4.3 + io.BytesIO

python3 ./src/ZOCache/tmp_benchmark.py 
2.689500024644076e-05 TemporaryFile
0.30429405899985795 SpooledTemporaryFile
0.348170792000019 BufferedRandom
0.0764778530001422 io.BytesIO
0.05162201000030109 concat

带有io.BytesIO的新源:

New source with io.BytesIO:

#!/usr/bin/python3
from __future__ import print_function
import io
import timeit
import tempfile


class Error(Exception):
    pass


def bench_temporaryfile():
    with tempfile.TemporaryFile() as out:
        for i in range(0, 100):
            out.write(b"Value = ")
            out.write(bytes(str(i), 'utf-8'))
            out.write(b" ")

        # Get string.
        out.seek(0)
        contents = out.read()
        out.close()
        # Test first letter.
        if contents[0:5] != b"Value":
            raise Error


def bench_spooledtemporaryfile():
    with tempfile.SpooledTemporaryFile(max_size=10*1024*1024) as out:
        for i in range(0, 100):
            out.write(b"Value = ")
            out.write(bytes(str(i), 'utf-8'))
            out.write(b" ")

        # Get string.
        out.seek(0)
        contents = out.read()
        out.close()
        # Test first letter.
        if contents[0:5] != b"Value":
            raise Error


def bench_BufferedRandom():
    # 1. BufferedRandom
    with io.open('out.bin', mode='w+b') as fp:
        with io.BufferedRandom(fp, buffer_size=10*1024*1024) as out:
            for i in range(0, 100):
                out.write(b"Value = ")
                out.write(bytes(i))
                out.write(b" ")

            # Get string.
            out.seek(0)
            contents = out.read()
            # Test first letter.
            if contents[0:5] != b'Value':
                raise Error


def bench_BytesIO():
    # 1. Use StringIO.
    out = io.BytesIO()
    for i in range(0, 100):
        out.write(b"Value = ")
        out.write(bytes(str(i), 'utf-8'))
        out.write(b" ")

    # Get string.
    contents = out.getvalue()
    out.close()
    # Test first letter.
    if contents[0:5] != b'Value':
        raise Error


def bench_concat():
    # 2. Use string appends.
    data = ""
    for i in range(0, 100):
        data += "Value = "
        data += str(i)
        data += " "
    # Test first letter.
    if data[0] != 'V':
        raise Error


if __name__ == '__main__':
    print(str(timeit.timeit('bench_temporaryfile()', setup="from __main__ import bench_temporaryfile", number=1000)) + " TemporaryFile")
    print(str(timeit.timeit('bench_spooledtemporaryfile()', setup="from __main__ import bench_spooledtemporaryfile", number=1000)) + " SpooledTemporaryFile")
    print(str(timeit.timeit('bench_BufferedRandom()', setup="from __main__ import bench_BufferedRandom", number=1000)) + " BufferedRandom")
    print(str(timeit.timeit("bench_BytesIO()", setup="from __main__ import bench_BytesIO", number=1000)) + " io.BytesIO")
    print(str(timeit.timeit("bench_concat()", setup="from __main__ import bench_concat", number=1000)) + " concat")

每个平台都适用吗?如果是这样,为什么?

Is that true for every platform? And if so why?

具有固定基准(和固定代码)的结果:

0.2675984420002351 TemporaryFile
0.28104681999866443 SpooledTemporaryFile
0.3555715570000757 BufferedRandom
0.10379689100045653 io.BytesIO
0.05650951399911719 concat

推荐答案

您最大的问题:

Your biggest problem: Per tdelaney, you never actually ran the TemporaryFile test; you omitted the parens in the timeit snippet (and only for that test, the others actually ran). So you were timing the time taken to lookup the name bench_temporaryfile, but not to actually call it. Change:

print(str(timeit.timeit('bench_temporaryfile', setup="from __main__ import bench_temporaryfile", number=1000)) + " TemporaryFile")

收件人:

print(str(timeit.timeit('bench_temporaryfile()', setup="from __main__ import bench_temporaryfile", number=1000)) + " TemporaryFile")

(添加括号以使其成为调用)进行修复.

(adding parens to make it a call) to fix.

其他一些问题:

io.StringIO与您的其他测试用例根本不同.具体来说,您要测试的所有其他类型都以二进制模式运行,读写str,并避免行尾转换. io.StringIO使用Python 3样式字符串(Python 2中为unicode),您的测试通过使用不同的文字并将其转换为unicode而不是bytes来确认.这增加了很多编码和解码开销,并且使用了更多的内存(对于同一数据,unicode使用str内存的2-4倍,这意味着更多的分配器开销,更多的复制开销等).

io.StringIO is fundamentally different from your other test cases. Specifically, all the other types you're testing with operate in binary mode, reading and writing str, and avoiding line ending conversions. io.StringIO uses Python 3 style strings (unicode in Python 2), which your tests acknowledge by using different literals and converting to unicode instead of bytes. This adds a lot of encoding and decoding overhead, as well as using a lot more memory (unicode uses 2-4x the memory of str for the same data, which means more allocator overhead, more copy overhead, etc.).

另一个主要区别是您为TemporaryFile设置了一个真正巨大的bufsize;几乎不需要发生系统调用,并且大多数写操作只是追加到缓冲区中的连续内存中.相比之下,io.StringIO将存储写入的各个值,并且仅当您使用getvalue()要求它们时才将它们连接在一起.

The other major difference is that you're setting a truly huge bufsize for TemporaryFile; few system calls would need to occur, and most writes are just appending to contiguous memory in the buffer. By contrast, io.StringIO is storing the individual values written, and only joining them together when you ask for them with getvalue().

此外,最后,您认为通过使用bytes构造函数可以实现向前兼容,但事实并非如此.在Python 2中,bytesstr的别名,因此bytes(10)返回'10',但是在Python 3中,bytes是完全不同的东西,向其传递整数将返回初始化为零的bytes大小的对象,bytes(10)返回b'\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00'.

Also, lastly, you think you're being forward compatible by using the bytes constructor, but you're not; in Python 2 bytes is an alias for str, so bytes(10) returns '10', but in Python 3, bytes is a totally different thing, and passing an integer to it returns a zero initialized bytes object of that size, bytes(10) returns b'\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00'.

如果您想要一个公平的测试用例,请至少切换到cStringIO.StringIOio.BytesIO而不是io.StringIO并统一编写bytes.通常,您不会自己显式设置TemporaryFile之类的缓冲区大小,因此您可以考虑删除该大小.

If you want a fair test case, at the very least switch to cStringIO.StringIO or io.BytesIO instead of io.StringIO and write bytes uniformly. Typically, you wouldn't explicitly set the buffer size for TemporaryFile and the like yourself, so you might consider dropping that.

在我自己的使用python 2.7.10的Linux x64上的测试中,使用ipython的%timeit魔术,排名是:

In my own tests on Linux x64 with Python 2.7.10, using ipython's %timeit magic, the ranking is:

  1. io.BytesIO每个循环〜48μs
  2. io.StringIO每个循环〜54μs(因此unicode开销不会增加太多)
  3. cStringIO.StringIO每个循环〜83μs
  4. 每个循环
  5. TemporaryFile〜2.8 ms (注意单位; ms比μs长1000倍)
  1. io.BytesIO ~48 μs per loop
  2. io.StringIO ~54 μs per loop (so unicode overhead didn't add much)
  3. cStringIO.StringIO ~83 μs per loop
  4. TemporaryFile ~2.8 ms per loop (note units; ms is 1000x longer than μs)

这并没有回到默认的缓冲区大小(我保留了测试中的显式bufsize).我怀疑TemporaryFile的行为会有很大不同(取决于操作系统和临时文件的处理方式;某些系统可能只存储在内存中,其他系统可能存储​​在/tmp中,但是,当然,/tmp可能只是还是要成为RAMdisk).

And that's without going back to default buffer sizes (I kept the explicit bufsize from your tests). I suspect the behavior of TemporaryFile will vary a lot more (depending on the OS and how temporary files are handled; some systems might just store in memory, others might store in /tmp, but of course, /tmp might just be a RAMdisk anyway).

某事告诉我您可能有一个设置,其中TemporaryFile本质上是一个普通的内存缓冲区,永远不会进入文件系统,而我的最终可能会最终停留在持久性存储上(如果只是短期的话);内存中发生的事情是可以预见的,但是当涉及到文件系统时(根据操作系统,内核设置等,TemporaryFile可以做到),系统之间的行为会有很大差异.

Something tells me you may have a setup where the TemporaryFile is basically a plain memory buffer that never goes to the file system, where mine may be ultimately ending up on persistent storage (if only for short periods); stuff happening in memory is predictable, but when you involve the file system (which TemporaryFile can, depending on OS, kernel settings, etc.), the behavior will differ a great deal between systems.

这篇关于tempfile.TemporaryFile与StringIO的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆