Python2:将.decode与errors ='replace'一起使用仍会返回错误 [英] Python2: Using .decode with errors='replace' still returns errors

查看:39
本文介绍了Python2:将.decode与errors ='replace'一起使用仍会返回错误的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

所以我有一个 message ,它是从一个未知编码的文件中读取的.我想发送到网页进行显示.我已经对UnicodeErrors进行了很多努力,并在StackOverflow上进行了许多Q& A,并认为我对Unicode和编码的工作方式有很好的了解.我当前的代码如下:

So I have a message which is read from a file of unknown encoding. I want to send to a webpage for display. I've grappled a lot with UnicodeErrors and have gone through many Q&As on StackOverflow and think I have decent understanding of how Unicode and encoding works. My current code looks like this

try :
            return message.decode(encoding='utf-8')
        except:
            try:
                return message.decode(encoding='latin-1')
            except:
                try:
                    print("Unable to entirely decode in latin or utf-8, will replace error characters with '?'")
                    return message.decode(encoding='utf-8', errors="replace")

然后将返回的消息转储到JSON中并发送到前端.

The returned message is then dumped into a JSON and send to the front end.

我认为是因为我在上一个 try上使用 errors =" replace" ,除了之外,我将以减少一些异常为代价来避免出现异常'?'显示器上的字符.可以接受的费用.

I assumed that because I'm using errors="replace"on the last try except that I was going to avoid exceptions at the expense of having a few '?' characters in my display. An acceptable cost.

但是,似乎我太有希望了,对于某些文件,我仍然收到 UnicodeDecodeException ,说"ascii编解码器无法解码".对于某些角色.为什么 errors =" replace" 不能只解决这个问题?

However, it seems that I was too hopeful, and for some files I still get a UnicodeDecodeException saying "ascii codecs cannot decode" for some character. Why doesn't errors="replace" just take care of this?

(还有一个额外的问题,ascii与其中任何一个有什么关系?.我指定的是UTF-8)

(also as a bonus question, what does ascii have to do with any of this?.. I'm specifying UTF-8)

推荐答案

您不应使用 errors ='replace'来获取 UnicodeDecodeError .同样, str.decode('latin-1')应该永远不会失败,因为ISO-8859-1对于每个可能的字节序列都有一个有效的字符映射.

You should not get a UnicodeDecodeError with errors='replace'. Also str.decode('latin-1') should never fail, because ISO-8859-1 has a valid character mapping for every possible byte sequence.

我怀疑 message 已经是一个 unicode 字符串,而不是字节.Unicode文本已经从字节解码"了,无法再解码了.

My suspicion is that message is already a unicode string, not bytes. Unicode text has already been ‘decoded’ from bytes and can't be decoded any more.

当您调用 .decode()一个 unicode 字符串时,Python 2会尝试提供帮助,并决定编码 Unicode字符串到字节(使用默认编码),这样您就可以真正解码.此隐式编码步骤使用 errors ='replace',因此,如果Unicode字符串中有任何字符不是默认编码(可能是ASCII)您将得到一个 Unicode En codeError .

When you call .decode() an a unicode string, Python 2 tries to be helpful and decides to encode the Unicode string back to bytes (using the default encoding), so that you have something that you can really decode. This implicit encoding step doesn't use errors='replace', so if there are any characters in the Unicode string that aren't in the default encoding (probably ASCII) you'll get a UnicodeEncodeError.

(Python 3不再这样做,因为它非常令人困惑.)

(Python 3 no longer does this as it is terribly confusing.)

检查的类型,并假设它确实是 Unicode ,然后从那里回过头来找到它的解码位置(可能是隐式的),以正确的解码替换它.

Check the type of message and assuming it is indeed Unicode, work back from there to find where it was decoded (possibly implicitly) to replace that with the correct decoding.

这篇关于Python2:将.decode与errors ='replace'一起使用仍会返回错误的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆