中文Unicode问题? [英] Chinese Unicode issue?
问题描述
来自本网站 http://engine.data.cnzz.com/main.php?s=engine&uv=&st=2014-03-01&et=2014-03-31
<td id="e_9" class="qh_one">百度汇总</td>我正在抓取文字,试图获取百度汇总
但是当我 r.encoding = 'utf-8'
结果是 ٶȻ
如果我不使用 utf-8
结果是 °Ù¶È»ã×Ü
解决方案 服务器不会在响应标头中告诉您任何有用的信息,但 HTML 页面本身包含:
<meta http-equiv="Content-Type" content="text/html; charset=gb2312"/>
GB2312 是一种宽度可变的编码,如 UTF-8.然而,页面是谎言;它实际上使用 GBK,GB2312 的扩展.
你可以用 GBK 解码就好了:
<预><代码>>>>len(r.content.decode('gbk'))44535>>>u'百度汇总' in r.content.decode('gbk')真的使用 gb2313
解码失败:
<预><代码>>>>r.content.decode('gb2312')回溯(最近一次调用最后一次):文件<stdin>",第 1 行,在 <module> 中UnicodeDecodeError: 'gb2312' 编解码器无法解码位置 26367-26368 中的字节:非法多字节序列但由于 GBK 是 GB2313 的超集,即使指定了后者,使用前者也应该始终是安全的.
如果您使用 requests
,则将 r.encoding
设置为 gb2312
有效,因为 r.text
使用处理解码错误时replace
:
content = str(self.content, encoding, errors='replace')
因此使用 GB2312 时的解码错误被屏蔽,用于仅在 GBK 中定义的那些代码点.
注意,BeautifulSoup 可以自己完成解码;它将找到 meta
标头:
<预><代码>>>>汤 = BeautifulSoup(r.content)警告:root:某些字符无法解码,并被替换为 REPLACEMENT CHARACTER.警告是由于页面声称使用 GB2312 时使用了 GBK 代码点.
From this website http://engine.data.cnzz.com/main.php?s=engine&uv=&st=2014-03-01&et=2014-03-31
<tr class="list03" onclick="showMen1(9);" style="cursor:pointer;">
<td id="e_9" class="qh_one">百度汇总</td>
I'm scraping the text and trying to get 百度汇总
but when I r.encoding = 'utf-8'
the result is �ٶȻ���
if I don't use utf-8
the result is °Ù¶È»ã×Ü
解决方案 The server doesn't tell you anything helpful in the response headers, but the HTML page itself contains:
<meta http-equiv="Content-Type" content="text/html; charset=gb2312" />
GB2312 is a variable-width encoding, like UTF-8. The page lies however; it in actual fact uses GBK, an extension to GB2312.
You can decode it with GBK just fine:
>>> len(r.content.decode('gbk'))
44535
>>> u'百度汇总' in r.content.decode('gbk')
True
Decoding with gb2313
fails:
>>> r.content.decode('gb2312')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'gb2312' codec can't decode bytes in position 26367-26368: illegal multibyte sequence
but since GBK is a superset of GB2313, it should always be safe to use the former even when the latter is specified.
If you are using requests
, then setting r.encoding
to gb2312
works because r.text
uses replace
when handling decode errors:
content = str(self.content, encoding, errors='replace')
so the decoding error when using GB2312 is masked for those codepoints only defined in GBK.
Note that BeautifulSoup can do the decoding all by itself; it'll find the meta
header:
>>> soup = BeautifulSoup(r.content)
WARNING:root:Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.
The warning is caused by the GBK codepoints being used while the page claims to use GB2312.
这篇关于中文Unicode问题?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!
查看全文
登录
关闭
扫码关注1秒登录
发送“验证码”获取
|
15天全站免登陆