防伪工作与Python编码 [英] Bulletproof work with encoding in Python

查看:109
本文介绍了防伪工作与Python编码的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在Python2中有关unicode的问题。

The question about unicode in Python2.

正如我所知,我应该总是 decode 从外部读取(文件,网络)。 decode 使用参数中指定的charset将外部字节转换为内部Python字符串。所以 decode(utf8)意味着外部字节是unicode字符串,它们将被解码为python字符串。

As I know about this I should always decode everything what I read from outside (files, net). decode converts outer bytes to internal Python strings using charset specified in parameters. So decode("utf8") means that outside bytes are unicode string and they will be decoded to python strings.

我也应该永远都是 encode 我写给外面的一切。我在 encode 函数的参数中指定编码,并将其转换为正确的编码和写入。

Also I should always encode everything what I write to outside. I specify encoding in parameters of encode function and it converts to proper encoding and writes.

这些语句是对的,不是吗?

These statements are right, ain't they?

但是有时当我解析html文档时,我会得到解码错误。当我理解其他编码的文档(例如 cp1252 ),当我尝试使用utf8编码解码这个错误。所以问题是如何写防弹应用程序?

But sometimes when I parse html documents I get decode errors. As I understand the document in other encoding (for example cp1252) and error happens when I try to decode this using utf8 encoding. So the question is how to write bulletproof application?

我发现有很好的图书馆猜测编码是 chardet ,这是编写防弹应用程序的唯一方法。

I found that there is good library to guess encoding is chardet and this is the only way to write bulletproof applications. Right?

推荐答案


... decode(utf8)意味着外部字节是unicode字符串,它们将被解码为python字符串。

... decode("utf8") means that outside bytes are unicode string and they will be decoded to python strings.

...

这些语句是对的,不是吗?

These statements are right, ain't they?

不,外部字节是二进制数据,它们不是unicode字符串。所以< str> .decode(utf8)将产生一个Python unicode code>< str> 作为UTF-8;如果字节无法解码为UTF-8,则可能会引起错误。

No, outside bytes are binary data, they are not a unicode string. So <str>.decode("utf8") will produce a Python unicode object by interpreting the bytes in <str> as UTF-8; it may raise an error if the bytes cannot be decoded as UTF-8.

确定任何给定文档的编码不一定是简单的任务。您需要有一些外部信息源来告诉您编码,或者您需要知道文档中的内容。例如,如果您知道它是一个内部指定的HTML文档,那么您可以使用一种算法来解析文档,例如 HTML标准中的一个,以找到编码,然后使用该编码解析文档(它是一个双程操作)。然而,只是因为HTML文档指定了一个编码,并不意味着它可以用该编码进行解码。如果数据损坏或文件编码不正确,您可能仍然会收到错误。

Determining the encoding of any given document is not necessarily a simple task. You either need to have some external source of information that tells you the encoding, or you need to know something about what is in the document. For example, if you know that it is an HTML document with its encoding specified internally, then you can parse the document using an algorithm like the one outlined in the HTML Standard to find the encoding and then use that encoding to parse the document (it's a two-pass operation). However, just because an HTML document specifies an encoding it does not mean that it can be decoded with that encoding. You may still get errors if the data is corrupt or if document was not encoded properly in the first place.

有一些库,例如 chardet (我看到你已经提到过),将尝试猜测一个文档的编码(/这只是一个猜测,不一定正确)。但是他们可以有自己的问题,如性能,他们可能无法识别您的文档编码。

There are libraries such as chardet (I see you mentioned it already) that will try to guess the encoding of a document for you (it's only a guess, not necessarily correct). But they can have their own issues such as performance, and they may not recognize the encoding of your document.

这篇关于防伪工作与Python编码的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆