Python:将Unicode转换为ASCII没有错误 [英] Python: Convert Unicode to ASCII without errors

查看:254
本文介绍了Python:将Unicode转换为ASCII没有错误的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我的代码只是删除一个网页,然后将其转换为Unicode。

My code just scrapes a web page, then converts it to Unicode.

html = urllib.urlopen(link).read()
html.encode("utf8","ignore")
self.response.out.write(html)

但我得到 UnicodeDecodeError

Traceback (most recent call last):
  File "/Applications/GoogleAppEngineLauncher.app/Contents/Resources/GoogleAppEngine-default.bundle/Contents/Resources/google_appengine/google/appengine/ext/webapp/__init__.py", line 507, in __call__
    handler.get(*groups)
  File "/Users/greg/clounce/main.py", line 55, in get
    html.encode("utf8","ignore")
UnicodeDecodeError: 'ascii' codec can't decode byte 0xa0 in position 2818: ordinal not in range(128)

所以我假设这意味着HTML包含一些错误的Unicode。我可以只删除任何代码字节引起的问题,而不是得到一个错误?

So I assume that means the HTML contains some wrongly-formed attempt at Unicode somewhere. Can I just drop whatever code bytes are causing the problem instead of getting an error?

推荐答案

链接

此外,当我们尝试 .encode()一个已经编码的字节字符串。所以你可以尝试先解码它,如

In addition, we usually encounter this problem here when we are trying to .encode() an already encoded byte string. So you might try to decode it first as in

html = urllib.urlopen(link).read()
unicode_str = html.decode(<source encoding>)
encoded_str = unicode_str.encode("utf8")

例如:

html = '\xa0'
encoded_str = html.encode("utf8")

无法使用

UnicodeDecodeError: 'ascii' codec can't decode byte 0xa0 in position 0: ordinal not in range(128)

While:

html = '\xa0'
decoded_str = html.decode("windows-1252")
encoded_str = decoded_str.encode("utf8")

成功无误。请注意,windows-1252是我用作示例的东西。我从chardet得到这个,它有0.5的信心,这是正确的! (好的,如一个1个字符长度的字符串,你期望什么)你应该更改为从 .urlopen()。read()返回的字节字符串的编码read()

Succeeds without error. Do note that "windows-1252" is something I used as an example. I got this from chardet and it had 0.5 confidence that it is right! (well, as given with a 1-character-length string, what do you expect) You should change that to the encoding of the byte string returned from .urlopen().read() to what applies to the content you retrieved.

我看到的另一个问题是 .encode() string方法返回修改后的字符串,并且不修改源中的位置。所以这是没有用的 self.response.out.write(html)作为html不是从html.encode编码的字符串(如果这是你最初的目标

Another problem I see there is that the .encode() string method returns the modified string and does not modify the source in place. So it's kind of useless to have self.response.out.write(html) as html is not the encoded string from html.encode (if that is what you were originally aiming for).

如Ignacio建议,检查源网页中 read()的返回字符串的实际编码。它在一个元标记中或在响应中的ContentType标题中。然后使用它作为 .decode()的参数。

As Ignacio suggested, check the source webpage for the actual encoding of the returned string from read(). It's either in one of the Meta tags or in the ContentType header in the response. Use that then as the parameter for .decode().

请注意,不应假设其他开发人员负责足够确保头和/或元字符集声明匹配实际内容。 (这是一个PITA,是的,我应该知道,我之前的一个)。

Do note however that it should not be assumed that other developers are responsible enough to make sure the header and/or meta character set declarations match the actual content. (Which is a PITA, yeah, I should know, I was one of those before).

这篇关于Python:将Unicode转换为ASCII没有错误的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆