将Unicode转换为ASCII且在Python中没有错误 [英] Convert Unicode to ASCII without errors in Python

查看:96
本文介绍了将Unicode转换为ASCII且在Python中没有错误的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我的代码只是抓取一个网页,然后将其转换为Unicode.

My code just scrapes a web page, then converts it to Unicode.

html = urllib.urlopen(link).read()
html.encode("utf8","ignore")
self.response.out.write(html)

但是我得到一个UnicodeDecodeError:

Traceback (most recent call last):
  File "/Applications/GoogleAppEngineLauncher.app/Contents/Resources/GoogleAppEngine-default.bundle/Contents/Resources/google_appengine/google/appengine/ext/webapp/__init__.py", line 507, in __call__
    handler.get(*groups)
  File "/Users/greg/clounce/main.py", line 55, in get
    html.encode("utf8","ignore")
UnicodeDecodeError: 'ascii' codec can't decode byte 0xa0 in position 2818: ordinal not in range(128)

我认为这意味着HTML在某处包含一些错误的Unicode尝试. 我能丢掉导致问题的任何代码字节而不出错吗?

I assume that means the HTML contains some wrongly-formed attempt at Unicode somewhere. Can I just drop whatever code bytes are causing the problem instead of getting an error?

推荐答案

2018更新:

从2018年2月开始,使用gzip之类的压缩方式已经非常流行(大约73%的网站都在使用它,包括Google,YouTube,Yahoo,Wikipedia,Reddit,Stack Overflow和Stack Exchange Network网站等大型网站).
如果您像原始答案中那样用gzip压缩后的响应进行简单的解码,则会收到类似或类似的错误:

2018 Update:

As of February 2018, using compressions like gzip has become quite popular (around 73% of all websites use it, including large sites like Google, YouTube, Yahoo, Wikipedia, Reddit, Stack Overflow and Stack Exchange Network sites).
If you do a simple decode like in the original answer with a gzipped response, you'll get an error like or similar to this:

UnicodeDecodeError:"utf8"编解码器无法解码位置1的字节0x8b:意外的代码字节

UnicodeDecodeError: 'utf8' codec can't decode byte 0x8b in position 1: unexpected code byte

为了解码gzpipped响应,您需要添加以下模块(在Python 3中):

In order to decode a gzpipped response you need to add the following modules (in Python 3):

import gzip
import io

注意: 在Python 2中,您将使用StringIO而不是io

Note: In Python 2 you'd use StringIO instead of io

然后您可以像这样解析内容:

Then you can parse the content out like this:

response = urlopen("https://example.com/gzipped-ressource")
buffer = io.BytesIO(response.read()) # Use StringIO.StringIO(response.read()) in Python 2
gzipped_file = gzip.GzipFile(fileobj=buffer)
decoded = gzipped_file.read()
content = decoded.decode("utf-8") # Replace utf-8 with the source encoding of your requested resource

此代码读取响应,并将字节放置在缓冲区中.然后,gzip模块使用GZipFile函数读取缓冲区.之后,可以将压缩后的文件再次读取为字节,最后将其解码为通常可读的文本.

This code reads the response, and places the bytes in a buffer. The gzip module then reads the buffer using the GZipFile function. After that, the gzipped file can be read into bytes again and decoded to normally readable text in the end.

我们可以获取用于link的实际值吗?

Can we get the actual value used for link?

此外,当我们尝试.encode()已经编码的字节字符串时,通常会在这里遇到此问题.因此,您可能首先尝试像在

In addition, we usually encounter this problem here when we are trying to .encode() an already encoded byte string. So you might try to decode it first as in

html = urllib.urlopen(link).read()
unicode_str = html.decode(<source encoding>)
encoded_str = unicode_str.encode("utf8")

例如:

html = '\xa0'
encoded_str = html.encode("utf8")

失败

UnicodeDecodeError: 'ascii' codec can't decode byte 0xa0 in position 0: ordinal not in range(128)

而:

html = '\xa0'
decoded_str = html.decode("windows-1252")
encoded_str = decoded_str.encode("utf8")

成功,没有错误.请注意,"windows-1252"是我用作示例的东西.我是从 chardet 那里获得的,它对它的置信度为0.5,这是正确的! (好吧,对于长度为1个字符的字符串,您会期望什么?)您应该将其更改为.urlopen().read()返回的字节字符串的编码,以适用于您检索的内容.

Succeeds without error. Do note that "windows-1252" is something I used as an example. I got this from chardet and it had 0.5 confidence that it is right! (well, as given with a 1-character-length string, what do you expect) You should change that to the encoding of the byte string returned from .urlopen().read() to what applies to the content you retrieved.

我看到的另一个问题是.encode()字符串方法返回修改后的字符串,而不是就地修改源.因此,具有self.response.out.write(html)是没有用的,因为html不是html.encode中的编码字符串(如果这是您最初的目标).

Another problem I see there is that the .encode() string method returns the modified string and does not modify the source in place. So it's kind of useless to have self.response.out.write(html) as html is not the encoded string from html.encode (if that is what you were originally aiming for).

按照Ignacio的建议,检查源网页以获取read()返回的字符串的实际编码.它位于响应的Meta标签之一或ContentType标头中.然后将其用作.decode()的参数.

As Ignacio suggested, check the source webpage for the actual encoding of the returned string from read(). It's either in one of the Meta tags or in the ContentType header in the response. Use that then as the parameter for .decode().

但是请注意,不应假定其他开发人员有足够的责任来确保标头和/或元字符集声明与实际内容匹配. (是PITA,是的,我应该知道,我其中之一).

Do note however that it should not be assumed that other developers are responsible enough to make sure the header and/or meta character set declarations match the actual content. (Which is a PITA, yeah, I should know, I was one of those before).

这篇关于将Unicode转换为ASCII且在Python中没有错误的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆