在 Python 中将 Unicode 转换为 ASCII 没有错误 [英] Convert Unicode to ASCII without errors in Python

查看:35
本文介绍了在 Python 中将 Unicode 转换为 ASCII 没有错误的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我的代码只是抓取一个网页,然后将其转换为 Unicode.

My code just scrapes a web page, then converts it to Unicode.

html = urllib.urlopen(link).read()
html.encode("utf8","ignore")
self.response.out.write(html)

但是我得到一个 UnicodeDecodeError:

Traceback (most recent call last):
  File "/Applications/GoogleAppEngineLauncher.app/Contents/Resources/GoogleAppEngine-default.bundle/Contents/Resources/google_appengine/google/appengine/ext/webapp/__init__.py", line 507, in __call__
    handler.get(*groups)
  File "/Users/greg/clounce/main.py", line 55, in get
    html.encode("utf8","ignore")
UnicodeDecodeError: 'ascii' codec can't decode byte 0xa0 in position 2818: ordinal not in range(128)

我认为这意味着 HTML 在某处包含一些错误格式的 Unicode 尝试.我可以删除导致问题的任何代码字节而不是出错吗?

I assume that means the HTML contains some wrongly-formed attempt at Unicode somewhere. Can I just drop whatever code bytes are causing the problem instead of getting an error?

推荐答案

2018 更新:

截至 2018 年 2 月,使用 gzip 之类的压缩已成为 非常受欢迎(大约 73% 的网站都使用它,包括 Google、YouTube、雅虎、维基百科、Reddit、Stack Overflow 和 Stack Exchange Network 网站等大型网站).
如果您像原始答案一样使用 gzip 响应进行简单的解码,您将收到类似或类似的错误:

2018 Update:

As of February 2018, using compressions like gzip has become quite popular (around 73% of all websites use it, including large sites like Google, YouTube, Yahoo, Wikipedia, Reddit, Stack Overflow and Stack Exchange Network sites).
If you do a simple decode like in the original answer with a gzipped response, you'll get an error like or similar to this:

UnicodeDecodeError: 'utf8' 编解码器无法解码位置 1 的字节 0x8b:意外的代码字节

UnicodeDecodeError: 'utf8' codec can't decode byte 0x8b in position 1: unexpected code byte

为了解码 gzpipped 响应,您需要添加以下模块(在 Python 3 中):

In order to decode a gzpipped response you need to add the following modules (in Python 3):

import gzip
import io

注意: 在 Python 2 中,您将使用 StringIO 而不是 io

然后你可以像这样解析内容:

Then you can parse the content out like this:

response = urlopen("https://example.com/gzipped-ressource")
buffer = io.BytesIO(response.read()) # Use StringIO.StringIO(response.read()) in Python 2
gzipped_file = gzip.GzipFile(fileobj=buffer)
decoded = gzipped_file.read()
content = decoded.decode("utf-8") # Replace utf-8 with the source encoding of your requested resource

此代码读取响应,并将字节放入缓冲区.gzip 模块然后使用 GZipFile 函数读取缓冲区.之后,可以再次将gzip压缩文件读入字节并最终解码为正常可读的文本.

This code reads the response, and places the bytes in a buffer. The gzip module then reads the buffer using the GZipFile function. After that, the gzipped file can be read into bytes again and decoded to normally readable text in the end.

我们能得到用于link的实际值吗?

Can we get the actual value used for link?

另外,当我们尝试.encode()一个已经编码的字节串时,我们通常会在这里遇到这个问题.所以你可以尝试先解码它

In addition, we usually encounter this problem here when we are trying to .encode() an already encoded byte string. So you might try to decode it first as in

html = urllib.urlopen(link).read()
unicode_str = html.decode(<source encoding>)
encoded_str = unicode_str.encode("utf8")

举个例子:

html = 'xa0'
encoded_str = html.encode("utf8")

失败

UnicodeDecodeError: 'ascii' codec can't decode byte 0xa0 in position 0: ordinal not in range(128)

虽然:

html = 'xa0'
decoded_str = html.decode("windows-1252")
encoded_str = decoded_str.encode("utf8")

成功无误.请注意,windows-1252"是我用作示例的东西.我从 chardet 得到这个,它有 0.5 的信心认为它是正确的!(好吧,对于 1 个字符长度的字符串,您期望什么)您应该将其更改为从 .urlopen().read() 返回的字节字符串的编码,以适用于什么到您检索的内容.

Succeeds without error. Do note that "windows-1252" is something I used as an example. I got this from chardet and it had 0.5 confidence that it is right! (well, as given with a 1-character-length string, what do you expect) You should change that to the encoding of the byte string returned from .urlopen().read() to what applies to the content you retrieved.

我看到的另一个问题是 .encode() 字符串方法返回修改后的字符串并且不会就地修改源.因此,使用 self.response.out.write(html) 是没有用的,因为 html 不是来自 html.encode 的编码字符串(如果这是您最初的目标).

Another problem I see there is that the .encode() string method returns the modified string and does not modify the source in place. So it's kind of useless to have self.response.out.write(html) as html is not the encoded string from html.encode (if that is what you were originally aiming for).

正如 Ignacio 建议的那样,检查源网页以了解从 read() 返回的字符串的实际编码.它要么在 Meta 标签之一中,要么在响应中的 ContentType 标头中.然后将其用作 .decode() 的参数.

As Ignacio suggested, check the source webpage for the actual encoding of the returned string from read(). It's either in one of the Meta tags or in the ContentType header in the response. Use that then as the parameter for .decode().

但是请注意,不应假设其他开发人员有足够的责任来确保标头和/或元字符集声明与实际内容匹配.(这是 PITA,是的,我应该知道,我之前其中之一.

Do note however that it should not be assumed that other developers are responsible enough to make sure the header and/or meta character set declarations match the actual content. (Which is a PITA, yeah, I should know, I was one of those before).

这篇关于在 Python 中将 Unicode 转换为 ASCII 没有错误的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆