Python2:将.decode与errors ='replace'一起使用仍会返回错误 [英] Python2: Using .decode with errors='replace' still returns errors
问题描述
所以我有一个 message
,它是从一个未知编码的文件中读取的.我想发送到网页进行显示.我已经对UnicodeErrors进行了很多努力,并在StackOverflow上进行了许多Q& A,并认为我对Unicode和编码的工作方式有很好的了解.我当前的代码如下:
So I have a message
which is read from a file of unknown encoding. I want to send to a webpage for display. I've grappled a lot with UnicodeErrors and have gone through many Q&As on StackOverflow and think I have decent understanding of how Unicode and encoding works. My current code looks like this
try :
return message.decode(encoding='utf-8')
except:
try:
return message.decode(encoding='latin-1')
except:
try:
print("Unable to entirely decode in latin or utf-8, will replace error characters with '?'")
return message.decode(encoding='utf-8', errors="replace")
然后将返回的消息转储到JSON中并发送到前端.
The returned message is then dumped into a JSON and send to the front end.
我认为是因为我在上一个 try上使用
之外,我将以减少一些异常为代价来避免出现异常'?'显示器上的字符.可以接受的费用. errors =" replace"
,除了
I assumed that because I'm using errors="replace"
on the last try except
that I was going to avoid exceptions at the expense of having a few '?' characters in my display. An acceptable cost.
但是,似乎我太有希望了,对于某些文件,我仍然收到 UnicodeDecodeException
,说"ascii编解码器无法解码".对于某些角色.为什么 errors =" replace"
不能只解决这个问题?
However, it seems that I was too hopeful, and for some files I still get a UnicodeDecodeException
saying "ascii codecs cannot decode" for some character. Why doesn't errors="replace"
just take care of this?
(还有一个额外的问题,ascii与其中任何一个有什么关系?.我指定的是UTF-8)
(also as a bonus question, what does ascii have to do with any of this?.. I'm specifying UTF-8)
推荐答案
您不应使用 errors ='replace'
来获取 UnicodeDecodeError
.同样, str.decode('latin-1')
应该永远不会失败,因为ISO-8859-1对于每个可能的字节序列都有一个有效的字符映射.
You should not get a UnicodeDecodeError
with errors='replace'
. Also str.decode('latin-1')
should never fail, because ISO-8859-1 has a valid character mapping for every possible byte sequence.
我怀疑 message
已经是一个 unicode
字符串,而不是字节.Unicode文本已经从字节解码"了,无法再解码了.
My suspicion is that message
is already a unicode
string, not bytes. Unicode text has already been ‘decoded’ from bytes and can't be decoded any more.
当您调用 .decode()
一个 unicode
字符串时,Python 2会尝试提供帮助,并决定编码 Unicode字符串到字节(使用默认编码),这样您就可以真正解码.此隐式编码步骤不使用 errors ='replace'
,因此,如果Unicode字符串中有任何字符不是默认编码(可能是ASCII)您将得到一个 Unicode En codeError
.
When you call .decode()
an a unicode
string, Python 2 tries to be helpful and decides to encode the Unicode string back to bytes (using the default encoding), so that you have something that you can really decode. This implicit encoding step doesn't use errors='replace'
, so if there are any characters in the Unicode string that aren't in the default encoding (probably ASCII) you'll get a UnicodeEncodeError
.
(Python 3不再这样做,因为它非常令人困惑.)
(Python 3 no longer does this as it is terribly confusing.)
检查 Unicode
,然后从那里回过头来找到它的解码位置(可能是隐式的),以正确的解码替换它.
Check the type of message
and assuming it is indeed Unicode
, work back from there to find where it was decoded (possibly implicitly) to replace that with the correct decoding.
这篇关于Python2:将.decode与errors ='replace'一起使用仍会返回错误的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!