在UTF-8字符编码问题 [英] Encoding issue of a character in utf-8

查看:216
本文介绍了在UTF-8字符编码问题的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我通过 a.get使用美丽的汤库('href属性)从一个网页的链接。在链接有一个奇怪的字符®但是当我成为®。我怎样才能连接code得当?我在页的开头已经添加了# - * - 编码:UTF-8 - * -

  R = requests.get(URL)汤= BeautifulSoup(r.text)


解决方案

待办事项的的使用 r.text ;离开解码到 BeautifulSoup

 汤= BeautifulSoup(r.content)

r.content 给你以字节为单位的响应,无需解码。 r.text ,另一方面,是响应去coded到 UNI code

什么情况是,服务器没有在响应头中的字符集。在那一刻,要求遵循的 HTTP RFC 2261,第3.7.1节文本/ 响应的默认的预计使用ISO- 8859-1(拉丁1)字符集。

有关HTML页面,即默认情况下是错误的,你得到不正确的结果; r.text 德codeD字节拉丁-1,导致的变为乱码

 >>>打印u'®'.encode('UTF8')。德code('latin1的')
®

HTML本身可以包括正确的编码的在HTML页面本身的,在一个的 <元>在HTML标题标记。 BeautifulSoup将使用页眉和德code中的字节你。

即使<&荟萃GT; 头标记缺失,BeautifulSoup包括其他方法的自动检测编码的。

I get a link from a web page by using beautiful soup library through a.get('href'). In the link there is a strange character ® but when I get it became ®. How can I encode it properly? I have already added at the beginning of the page # -*- coding: utf-8 -*-

r = requests.get(url)

soup = BeautifulSoup(r.text)

解决方案

Do not use r.text; leave decoding to BeautifulSoup:

soup = BeautifulSoup(r.content)

r.content gives you the response in bytes, without decoding. r.text on the other hand, is the response decoded to unicode.

What happens is that the server did not include the character-set in the response headers. At that moment, requests follows the HTTP RFC 2261, section 3.7.1: text/ responses by default are expected to use the ISO-8859-1 (Latin 1) character set.

For your HTML page, that default is wrong, and you got incorrect results; r.text decoded the bytes as Latin-1, resulting in a Mojibake:

>>> print u'®'.encode('utf8').decode('latin1')
®

HTML can itself include the correct encoding in the HTML page itself, in the form of a <meta> tag in the HTML header. BeautifulSoup will use that header and decode the bytes for you.

Even if the <meta> header tag is missing, BeautifulSoup includes other methods to auto-detect encodings.

这篇关于在UTF-8字符编码问题的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆