codecs.open(utf-8)无法读取纯ASCII文件 [英] codecs.open(utf-8) fails to read plain ASCII file
问题描述
我有一个普通的ASCII文件.当我尝试使用codecs.open(..., "utf-8")
打开它时,我无法读取单个字符. ASCII是UTF-8的子集,为什么codecs
为什么不能在UTF-8模式下打开这样的文件?
I have a plain ASCII file. When I try to open it with codecs.open(..., "utf-8")
, I am unable to read single characters. ASCII is a subset of UTF-8, so why can't codecs
open such a file in UTF-8 mode?
# test.py
import codecs
f = codecs.open("test.py", "r", "utf-8")
# ASCII is supposed to be a subset of UTF-8:
# http://www.fileformat.info/info/unicode/utf8.htm
assert len(f.read(1)) == 1 # OK
f.readline()
c = f.read(1)
print len(c)
print "'%s'" % c
assert len(c) == 1 # fails
# max% p test.py
# 63
# '
# import codecs
#
# f = codecs.open("test.py", "r", "utf-8")
#
# # ASC'
# Traceback (most recent call last):
# File "test.py", line 15, in <module>
# assert len(c) == 1 # fails
# AssertionError
# max%
系统:
Linux max 4.4.0-89-generic #112~14.04.1-Ubuntu SMP Tue Aug 1 22:08:32 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux
当然,它可以与常规open
一起使用.如果删除"utf-8"
选项,它也可以工作.还有63
是什么意思?就像第三行的中间一样.我不明白.
Of course it works with regular open
. It also works if I remove the "utf-8"
option. Also what does 63
mean? That's like the middle of the 3rd line. I don't get it.
推荐答案
发现了问题:
通过编码时,codecs.open
返回一个StreamReaderWriter
,它实际上只是一个包装器(不是的子类;它是由...组成的"关系,而不是继承)StreamWriter
.问题是:
When passed an encoding, codecs.open
returns a StreamReaderWriter
, which is really just a wrapper around (not a subclass of; it's a "composed of" relationship, not inheritance) StreamReader
and StreamWriter
. Problem is:
-
StreamReaderWriter
提供了常规"read
方法(即,它带有一个size
参数就可以了) - 它委托给内部
StreamReader.read
方法,其中size
参数仅是要读取的字节数的提示,而不是限制; second 参数chars
是一个严格的限制器,但是StreamReaderWriter
从不传递该参数(不接受) - 提示
size
时,但未使用chars
进行上限,如果StreamReader
具有缓冲的数据,并且其大小足以匹配size
提示StreamReader.read
则盲目返回缓冲区的内容,而不是限制它可以基于size
提示进行任何操作(毕竟,只有chars
会施加 maximum 返回大小)
StreamReaderWriter
provides a "normal"read
method (that is, it takes asize
parameter and that's it)- It delegates to the internal
StreamReader.read
method, where thesize
argument is only a hint as to the number of bytes to read, but not a limit; the second argument,chars
, is a strict limiter, butStreamReaderWriter
never passes that argument along (it doesn't accept it) - When
size
hinted, but not capped usingchars
, ifStreamReader
has buffered data, and it's large enough to match thesize
hintStreamReader.read
blindly returns the contents of the buffer, rather than limiting it in any way based on thesize
hint (after all, onlychars
imposes a maximum return size)
StreamReader.read
的API和size
/chars
的API含义是这里唯一记录的内容; codecs.open
返回StreamReaderWriter
的事实不是契约性的,也不是StreamReaderWriter
包装StreamReader
的事实,我只是使用ipython
的??
魔术来读取codecs
模块的源代码验证此行为.但是,无论是否有记录,这就是它的作用(可以随意阅读StreamReaderWriter
的源代码,它全部是Python级别的,因此很容易).
The API of StreamReader.read
and the meaning of size
/chars
for the API is the only documented thing here; the fact that codecs.open
returns StreamReaderWriter
is not contractual, nor is the fact that StreamReaderWriter
wraps StreamReader
, I just used ipython
's ??
magic to read the source code of the codecs
module to verify this behavior. But documented or not, that's what it's doing (feel free to read the source code for StreamReaderWriter
, it's all Python level, so it's easy).
最好的解决方案是切换到io.open
,这在每种标准情况下都更快,更正确(codecs.open
支持在bytes
[Py2 str
]和unicode
],而是处理str
到str
或bytes
到bytes
编码,但这是一个非常有限的用例;大多数情况下,您是在bytes
之间进行转换和str
).您所需要做的就是导入io
而不是codecs
,然后将codecs.open
行更改为:
The best solution is to switch to io.open
, which is faster and more correct in every standard case (codecs.open
supports the weirdo codecs that don't convert between bytes
[Py2 str
] and str
[Py2 unicode
], but rather, handle str
to str
or bytes
to bytes
encodings, but that's an incredibly limited use case; most of the time, you're converting between bytes
and str
). All you need to do is import io
instead of codecs
, and change the codecs.open
line to:
f = io.open("test.py", encoding="utf-8")
其余代码可以保持不变(并且可能会以更快的速度启动).
The rest of your code can remain unchanged (and will likely run faster to boot).
或者,您可以显式地绕过StreamReaderWriter
以获得StreamReader
的read
方法并直接传递限制参数,例如更改:
As an alternative, you could explicitly bypass StreamReaderWriter
to get the StreamReader
's read
method and pass the limiting argument directly, e.g. change:
c = f.read(1)
收件人:
# Pass second, character limiting argument after size hint
c = f.reader.read(6, 1) # 6 is sort of arbitrary; should ensure a full char read in one go
我怀疑 Python错误#8260 ,该问题涵盖了codecs.open
和readline
的混合>创建的文件对象,在此处正式应用,即为已修复",但如果您阅读注释,则修复未完成(鉴于已记录的API,修复可能无法完成); read
和readline
的任意古怪组合将可以破解它.
I suspect Python Bug #8260, which covers intermingling readline
and read
on codecs.open
created file objects, applies here, officially, it's "fixed", but if you read the comments, the fix wasn't complete (and may not be possible to complete given the documented API); arbitrarily weird combinations of read
and readline
will be able to break it.
再次,只需使用io.open
;只要您使用的是Python 2.6或更高版本,它就可以使用,并且会更好.
Again, just use io.open
; as long as you're on Python 2.6 or higher, it's available, and it's just plain better.
这篇关于codecs.open(utf-8)无法读取纯ASCII文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!