中文Unicode问题? [英] Chinese Unicode issue?

查看:31
本文介绍了中文Unicode问题?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

来自本网站 http://engine.data.cnzz.com/main.php?s=engine&uv=&st=2014-03-01&et=2014-03-31

<td id="e_9" class="qh_one">百度汇总</td>

我正在抓取文字,试图获取百度汇总

但是当我 r.encoding = 'utf-8' 结果是 ٶȻ

如果我不使用 utf-8 结果是 °Ù¶È»ã×Ü

解决方案

服务器不会在响应标头中告诉您任何有用的信息,但 HTML 页面本身包含:

<meta http-equiv="Content-Type" content="text/html; charset=gb2312"/>

GB2312 是一种宽度可变的编码,如 UTF-8.然而,页面是谎言;它实际上使用 GBK,GB2312 的扩展.

你可以用 GBK 解码就好了:

<预><代码>>>>len(r.content.decode('gbk'))44535>>>u'百度汇总' in r.content.decode('gbk')真的

使用 gb2313 解码失败:

<预><代码>>>>r.content.decode('gb2312')回溯(最近一次调用最后一次):文件<stdin>",第 1 行,在 <module> 中UnicodeDecodeError: 'gb2312' 编解码器无法解码位置 26367-26368 中的字节:非法多字节序列

但由于 GBK 是 GB2313 的超集,即使指定了后者,使用前者也应该始终是安全的.

如果您使用 requests,则将 r.encoding 设置为 gb2312 有效,因为 r.text 使用处理解码错误时replace:

content = str(self.content, encoding, errors='replace')

因此使用 GB2312 时的解码错误被屏蔽,用于仅在 GBK 中定义的那些代码点.

注意,BeautifulSoup 可以自己完成解码;它将找到 meta 标头:

<预><代码>>>>汤 = BeautifulSoup(r.content)警告:root:某些字符无法解码,并被替换为 REPLACEMENT CHARACTER.

警告是由于页面声称使用 GB2312 时使用了 GBK 代码点.

From this website http://engine.data.cnzz.com/main.php?s=engine&uv=&st=2014-03-01&et=2014-03-31

<tr class="list03" onclick="showMen1(9);" style="cursor:pointer;">
<td id="e_9" class="qh_one">百度汇总</td>

I'm scraping the text and trying to get 百度汇总

but when I r.encoding = 'utf-8' the result is �ٶȻ���

if I don't use utf-8 the result is °Ù¶È»ã×Ü

解决方案

The server doesn't tell you anything helpful in the response headers, but the HTML page itself contains:

<meta http-equiv="Content-Type" content="text/html; charset=gb2312" />

GB2312 is a variable-width encoding, like UTF-8. The page lies however; it in actual fact uses GBK, an extension to GB2312.

You can decode it with GBK just fine:

>>> len(r.content.decode('gbk'))
44535
>>> u'百度汇总' in r.content.decode('gbk')
True

Decoding with gb2313 fails:

>>> r.content.decode('gb2312')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'gb2312' codec can't decode bytes in position 26367-26368: illegal multibyte sequence

but since GBK is a superset of GB2313, it should always be safe to use the former even when the latter is specified.

If you are using requests, then setting r.encoding to gb2312 works because r.text uses replace when handling decode errors:

content = str(self.content, encoding, errors='replace')

so the decoding error when using GB2312 is masked for those codepoints only defined in GBK.

Note that BeautifulSoup can do the decoding all by itself; it'll find the meta header:

>>> soup = BeautifulSoup(r.content)
WARNING:root:Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.

The warning is caused by the GBK codepoints being used while the page claims to use GB2312.

这篇关于中文Unicode问题?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆