对StringIO,cStringIO和ByteIO感到困惑 [英] Confusing about StringIO, cStringIO and ByteIO
问题描述
我已经用谷歌搜索并且还在SO上搜索这些缓冲模块之间的区别.但是,我仍然不太了解,我认为我阅读的一些帖子已经过时了.
I have googled and also search on SO for the difference between these buffer modules. However, I still don't understand very well and I think some of the posts I read are out of date.
在Python 2.7.11中,我使用r = requests.get(url)
下载了特定格式的二进制文件.然后,我将StringIO.StringIO(r.content)
,cStringIO.StringIO(r.content)
和io.BytesIO(r.content)
传递给了一个用于解析内容的函数.
In Python 2.7.11, I downloaded a binary file of a specific format using r = requests.get(url)
. Then I passed StringIO.StringIO(r.content)
, cStringIO.StringIO(r.content)
and io.BytesIO(r.content)
to a function designed for parsing the content.
这三种方法都可用.我的意思是,即使文件是二进制文件,使用StringIO
仍然可行.为什么?
All these three methods are available. I mean, even if the file is binary, it's still feasible to use StringIO
. Why?
另一件事是关于它们的效率.
Another thing is concerning their efficiency.
In [1]: import StringIO, cStringIO, io
In [2]: from numpy import random
In [3]: x = random.random(1000000)
In [4]: %timeit y = cStringIO.StringIO(x)
1000000 loops, best of 3: 736 ns per loop
In [5]: %timeit y = StringIO.StringIO(x)
1000 loops, best of 3: 283 µs per loop
In [6]: %timeit y = io.BytesIO(x)
1000 loops, best of 3: 1.26 ms per loop
如上所述,cStringIO > StringIO > BytesIO
.
我发现有人提到io.BytesIO
总是制作新副本,这会花费更多时间.但是也有一些帖子提到,此问题已在更高的Python版本中得到解决.
I found someone mentioned that io.BytesIO
always makes a new copy which costs more time. But there are also some posts mentioned that this was fixed in later Python versions.
那么,有人能在最新的Python 2.x和3.x中对这两个IO
进行全面比较吗?
So, can anyone make a thorough comparison between these IO
s, in both latest Python 2.x and 3.x?
我找到了一些参考文献:
Some of the reference I found:
io.StringIO需要一个unicode字符串. io.BytesIO需要一个字节字符串. StringIO.StringIO允许使用unicode或bytes字符串. cStringIO.StringIO需要一个编码为字节字符串的字符串.
io.StringIO requires a unicode string. io.BytesIO requires a bytes string. StringIO.StringIO allows either unicode or bytes string. cStringIO.StringIO requires a string that is encoded as a bytes string.
但是cStringIO.StringIO('abc')
不会引发任何错误.
But cStringIO.StringIO('abc')
doesn't raise any error.
StringIO类是用于此目的的错误类,尤其是考虑到子单元v2是二进制而不是字符串.
The StringIO class is the wrong class to use for this, especially considering that subunit v2 is binary and not a string.
http://comments.gmane.org/gmane.comp .python.devel/148717
cStringIO.StringIO(b'data')没有复制数据,而io.BytesIO(b'data')进行了复制(即使以后不修改数据).
cStringIO.StringIO(b'data') didn't copy the data while io.BytesIO(b'data') makes a copy (even if the data is not modified later).
2014年这篇文章中有一个修补程序.
There is a fix patch in this post in 2014.
- 此处未列出很多SO帖子.
以下是埃里克(Eric)示例的python 2.7结果
Here are the Python 2.7 results for Eric's example
%timeit cStringIO.StringIO(u_data)
1000000 loops, best of 3: 488 ns per loop
%timeit cStringIO.StringIO(b_data)
1000000 loops, best of 3: 448 ns per loop
%timeit StringIO.StringIO(u_data)
1000000 loops, best of 3: 1.15 µs per loop
%timeit StringIO.StringIO(b_data)
1000000 loops, best of 3: 1.19 µs per loop
%timeit io.StringIO(u_data)
1000 loops, best of 3: 304 µs per loop
# %timeit io.StringIO(b_data)
# error
# %timeit io.BytesIO(u_data)
# error
%timeit io.BytesIO(b_data)
10000 loops, best of 3: 77.5 µs per loop
对于2.7,cStringIO.StringIO
和StringIO.StringIO
的效率远远高于io
.
As for 2.7, cStringIO.StringIO
and StringIO.StringIO
are far more efficient than io
.
推荐答案
在python 2和3中,应使用io.StringIO
处理unicode
对象,使用io.BytesIO
处理bytes
对象,以实现前向兼容性(这是3个都必须提供的).
You should use io.StringIO
for handling unicode
objects and io.BytesIO
for handling bytes
objects in both python 2 and 3, for forwards-compatibility (this is all 3 has to offer).
这是一个更好的测试(针对python 2和3),其中不包括从numpy到str
/bytes
Here's a better test (for python 2 and 3), that doesn't include conversion costs from numpy to str
/bytes
import numpy as np
import string
b_data = np.random.choice(list(string.printable), size=1000000).tobytes()
u_data = b_data.decode('ascii')
u_data = u'\u2603' + u_data[1:] # add a non-ascii character
然后:
import io
%timeit io.StringIO(u_data)
%timeit io.StringIO(b_data)
%timeit io.BytesIO(u_data)
%timeit io.BytesIO(b_data)
在python 2中,您还可以测试:
In python 2, you can also test:
import StringIO, cStringIO
%timeit cStringIO.StringIO(u_data)
%timeit cStringIO.StringIO(b_data)
%timeit StringIO.StringIO(u_data)
%timeit StringIO.StringIO(b_data)
其中一些会崩溃,抱怨非ASCII字符
Some of these will crash, complaining about non-ascii characters
Python 3.5结果:
Python 3.5 results:
>>> %timeit io.StringIO(u_data)
100 loops, best of 3: 8.61 ms per loop
>>> %timeit io.StringIO(b_data)
TypeError: initial_value must be str or None, not bytes
>>> %timeit io.BytesIO(u_data)
TypeError: a bytes-like object is required, not 'str'
>>> %timeit io.BytesIO(b_data)
The slowest run took 6.79 times longer than the fastest. This could mean that an intermediate result is being cached
1000000 loops, best of 3: 344 ns per loop
Python 2.7结果(在另一台机器上运行):
Python 2.7 results (run on a different machine):
>>> %timeit io.StringIO(u_data)
1000 loops, best of 3: 304 µs per loop
>>> %timeit io.StringIO(b_data)
TypeError: initial_value must be unicode or None, not str
>>> %timeit io.BytesIO(u_data)
TypeError: 'unicode' does not have the buffer interface
>>> %timeit io.BytesIO(b_data)
10000 loops, best of 3: 77.5 µs per loop
>>> %timeit cStringIO.StringIO(u_data)
UnicodeEncodeError: 'ascii' codec cant encode character u'\u2603' in position 0: ordinal not in range(128)
>>> %timeit cStringIO.StringIO(b_data)
1000000 loops, best of 3: 448 ns per loop
>>> %timeit StringIO.StringIO(u_data)
1000000 loops, best of 3: 1.15 µs per loop
>>> %timeit StringIO.StringIO(b_data)
1000000 loops, best of 3: 1.19 µs per loop
这篇关于对StringIO,cStringIO和ByteIO感到困惑的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!