在UTF-8字符编码问题 [英] Encoding issue of a character in utf-8
问题描述
我通过 a.get使用美丽的汤库('href属性)从一个网页的链接
。在链接有一个奇怪的字符®
但是当我成为®
。我怎样才能连接code得当?我在页的开头已经添加了# - * - 编码:UTF-8 - * -
R = requests.get(URL)汤= BeautifulSoup(r.text)
待办事项的不的使用 r.text
;离开解码到 BeautifulSoup
:
汤= BeautifulSoup(r.content)
r.content
给你以字节为单位的响应,无需解码。 r.text
,另一方面,是响应去coded到 UNI code
。
什么情况是,服务器没有在响应头中的字符集。在那一刻,要求
遵循的 HTTP RFC 2261,第3.7.1节:文本/
响应的默认的预计使用ISO- 8859-1(拉丁1)字符集。
有关HTML页面,即默认情况下是错误的,你得到不正确的结果; r.text
德codeD字节拉丁-1,导致的变为乱码:
>>>打印u'®'.encode('UTF8')。德code('latin1的')
®
HTML本身可以包括正确的编码的在HTML页面本身的,在一个的 <元>在HTML标题
标记。 BeautifulSoup将使用页眉和德code中的字节你。
即使<&荟萃GT;
头标记缺失,BeautifulSoup包括其他方法的自动检测编码的。
I get a link from a web page by using beautiful soup library through a.get('href')
. In the link there is a strange character ®
but when I get it became ®
. How can I encode it properly? I have already added at the beginning of the page # -*- coding: utf-8 -*-
r = requests.get(url)
soup = BeautifulSoup(r.text)
Do not use r.text
; leave decoding to BeautifulSoup
:
soup = BeautifulSoup(r.content)
r.content
gives you the response in bytes, without decoding. r.text
on the other hand, is the response decoded to unicode
.
What happens is that the server did not include the character-set in the response headers. At that moment, requests
follows the HTTP RFC 2261, section 3.7.1: text/
responses by default are expected to use the ISO-8859-1 (Latin 1) character set.
For your HTML page, that default is wrong, and you got incorrect results; r.text
decoded the bytes as Latin-1, resulting in a Mojibake:
>>> print u'®'.encode('utf8').decode('latin1')
®
HTML can itself include the correct encoding in the HTML page itself, in the form of a <meta>
tag in the HTML header. BeautifulSoup will use that header and decode the bytes for you.
Even if the <meta>
header tag is missing, BeautifulSoup includes other methods to auto-detect encodings.
这篇关于在UTF-8字符编码问题的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!