codecs.open(utf-8)无法读取纯ASCII文件 [英] codecs.open(utf-8) fails to read plain ASCII file

查看:268
本文介绍了codecs.open(utf-8)无法读取纯ASCII文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个普通的ASCII文件.当我尝试使用codecs.open(..., "utf-8")打开它时,我无法读取单个字符. ASCII是UTF-8的子集,为什么codecs为什么不能在UTF-8模式下打开这样的文件?

I have a plain ASCII file. When I try to open it with codecs.open(..., "utf-8"), I am unable to read single characters. ASCII is a subset of UTF-8, so why can't codecs open such a file in UTF-8 mode?

# test.py

import codecs

f = codecs.open("test.py", "r", "utf-8")

# ASCII is supposed to be a subset of UTF-8:
# http://www.fileformat.info/info/unicode/utf8.htm

assert len(f.read(1)) == 1 # OK
f.readline()
c = f.read(1)
print len(c)
print "'%s'" % c
assert len(c) == 1 # fails

# max% p test.py
# 63
# '
# import codecs
#
# f = codecs.open("test.py", "r", "utf-8")
#
# # ASC'
# Traceback (most recent call last):
#   File "test.py", line 15, in <module>
#     assert len(c) == 1 # fails
# AssertionError
# max%

系统:

Linux max 4.4.0-89-generic #112~14.04.1-Ubuntu SMP Tue Aug 1 22:08:32 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux

当然,它可以与常规open一起使用.如果删除"utf-8"选项,它也可以工作.还有63是什么意思?就像第三行的中间一样.我不明白.

Of course it works with regular open. It also works if I remove the "utf-8" option. Also what does 63 mean? That's like the middle of the 3rd line. I don't get it.

推荐答案

发现了问题:

通过编码时,codecs.open返回一个StreamReaderWriter,它实际上只是一个包装器(不是的子类;它是由...组成的"关系,而不是继承)StreamWriter.问题是:

When passed an encoding, codecs.open returns a StreamReaderWriter, which is really just a wrapper around (not a subclass of; it's a "composed of" relationship, not inheritance) StreamReader and StreamWriter. Problem is:

  1. StreamReaderWriter提供了常规" read方法(即,它带有一个size参数就可以了)
  2. 它委托给内部 StreamReader.read方法,其中size参数仅是要读取的字节数的提示,而不是限制; second 参数chars是一个严格的限制器,但是StreamReaderWriter从不传递该参数(不接受)
  3. 提示size时,但未使用chars进行上限,如果StreamReader具有缓冲的数据,并且其大小足以匹配size提示StreamReader.read则盲目返回缓冲区的内容,而不是限制它可以基于size提示进行任何操作(毕竟,只有chars会施加 maximum 返回大小)
  1. StreamReaderWriter provides a "normal" read method (that is, it takes a size parameter and that's it)
  2. It delegates to the internal StreamReader.read method, where the size argument is only a hint as to the number of bytes to read, but not a limit; the second argument, chars, is a strict limiter, but StreamReaderWriter never passes that argument along (it doesn't accept it)
  3. When size hinted, but not capped using chars, if StreamReader has buffered data, and it's large enough to match the size hint StreamReader.read blindly returns the contents of the buffer, rather than limiting it in any way based on the size hint (after all, only chars imposes a maximum return size)

StreamReader.read的API和size/chars的API含义是这里唯一记录的内容; codecs.open返回StreamReaderWriter的事实不是契约性的,也不是StreamReaderWriter包装StreamReader的事实,我只是使用ipython??魔术来读取codecs模块的源代码验证此行为.但是,无论是否有记录,这就是它的作用(可以随意阅读StreamReaderWriter的源代码,它全部是Python级别的,因此很容易).

The API of StreamReader.read and the meaning of size/chars for the API is the only documented thing here; the fact that codecs.open returns StreamReaderWriter is not contractual, nor is the fact that StreamReaderWriter wraps StreamReader, I just used ipython's ?? magic to read the source code of the codecs module to verify this behavior. But documented or not, that's what it's doing (feel free to read the source code for StreamReaderWriter, it's all Python level, so it's easy).

最好的解决方案是切换到io.open,这在每种标准情况下都更快,更正确(codecs.open支持在bytes [Py2 str]和 [Py2 unicode],而是处理strstrbytesbytes编码,但这是一个非常有限的用例;大多数情况下,您是在bytes之间进行转换和str).您所需要做的就是导入io而不是codecs,然后将codecs.open行更改为:

The best solution is to switch to io.open, which is faster and more correct in every standard case (codecs.open supports the weirdo codecs that don't convert between bytes [Py2 str] and str [Py2 unicode], but rather, handle str to str or bytes to bytes encodings, but that's an incredibly limited use case; most of the time, you're converting between bytes and str). All you need to do is import io instead of codecs, and change the codecs.open line to:

f = io.open("test.py", encoding="utf-8")

其余代码可以保持不变(并且可能会以更快的速度启动).

The rest of your code can remain unchanged (and will likely run faster to boot).

或者,您可以显式地绕过StreamReaderWriter以获得StreamReaderread方法并直接传递限制参数,例如更改:

As an alternative, you could explicitly bypass StreamReaderWriter to get the StreamReader's read method and pass the limiting argument directly, e.g. change:

c = f.read(1)

收件人:

# Pass second, character limiting argument after size hint
c = f.reader.read(6, 1)  # 6 is sort of arbitrary; should ensure a full char read in one go

我怀疑 Python错误#8260 ,该问题涵盖了codecs.openreadline的混合>创建的文件对象,在此处正式应用,即为已修复",但如果您阅读注释,则修复未完成(鉴于已记录的API,修复可能无法完成); readreadline的任意古怪组合将可以破解它.

I suspect Python Bug #8260, which covers intermingling readline and read on codecs.open created file objects, applies here, officially, it's "fixed", but if you read the comments, the fix wasn't complete (and may not be possible to complete given the documented API); arbitrarily weird combinations of read and readline will be able to break it.

再次,只需使用io.open;只要您使用的是Python 2.6或更高版本,它就可以使用,并且会更好.

Again, just use io.open; as long as you're on Python 2.6 or higher, it's available, and it's just plain better.

这篇关于codecs.open(utf-8)无法读取纯ASCII文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆