请求模块编码提供与HTML编码不同的编码 [英] Requests module encoding provides different encode then HTML encode
问题描述
请求模块 encoding
提供的编码与HTML页面中实际设置的编码不同
代码:
导入请求
URL = http://www.reynamining.com/nuevositio/ contacto.html
obj = requests.get(URL,timeout = 60,verify = False,allow_redirects = True)
print obj.encoding
输出:
ISO- 8859-1
其中在HTML中设置的实际编码为 UTF- 8
content = text / html; charset = UTF-8
我的问题是:
- 为什么
requests.encoding
显示的编码不同于描述的编码在HTML页面中??
我正在尝试使用此方法 objReq将编码转换为UTF-8 .content.decode(encodes).encode( utf-8)
,因为当我解码时,它已经在 UTF-8
中使用ISO-8859-1并使用UTF-8进行编码,即会更改值,即á
对此Ã
<的更改/ p>
有没有办法将所有类型的编码转换为UTF-8?
请求将具有<$ c的 response.encoding
属性设置为 ISO-8859-1
$ c> text / * 响应,并且没有在响应标头中指定任何内容类型。
请参阅高级中的 编码部分em>文档:
只有在HTTP标头中没有显式字符集的情况下,请求才会执行此操作, em>和
Content-Type
标头包含文本
。 在这种情况下,RFC 2616指定默认字符集必须为ISO-8859-1
。在这种情况下,请求遵循规范。如果需要其他编码,则可以手动设置Response.encoding
属性,或使用原始的Response.content
。
重点突出的矿井。
您可以通过查看在 Content-Type
标题中的 charset
参数中:
resp = request.get(....)
encoding = resp.encoding如果resp.headers.get('content-type',`` ).lower()else None
您的HTML文档在< meta>
标头,并且此标头是权威的:
<元http-equiv = Content-Type content = text / html; charset = UTF-8 />
HTML 5还定义了< meta charset = ... />
标签,请参阅<元字符集= utf-8> vs< meta http-equiv = Content-Type>
您应该不将HTML页面重新编码为UTF如果它们包含带有不同编解码器的标头,则为-8。在这种情况下,您至少必须纠正该标头。
使用BeautifulSoup:
#如果设置为标头则传递显式编码
类似地,其他文档标准也可能会指定特定的编码;例如,pre>
encoding = resp.encoding如果resp.headers.get('content-type',`` ).lower()else None
内容=对应内容
汤= BeautifulSoup(内容,from_encoding =编码)
如果soup.original_encoding!='utf-8':
meta = soup.select_one('meta [charset],meta [http-equiv = Content-Type]')
if meta:
#在重新编码$ b $之前替换元字符集信息b如果meta.attrs中的'charset':
meta ['charset'] ='utf-8'
else:
meta ['content'] ='text / html; charset = utf-8'
#重新编码为UTF-8
content = soup.prettify()#默认情况下编码为UTF-8
。例如,XML始终是UTF-8,除非由
<?xml encoding = ... ...?>
XML声明指定,该声明也是文档的一部分The request module
encoding
provides different encoding then the actual set encoding in HTML pageCode:
import requests URL = "http://www.reynamining.com/nuevositio/contacto.html" obj = requests.get(URL, timeout=60, verify=False, allow_redirects=True) print obj.encoding
Output:
ISO-8859-1
Where as the actual encoding set in the HTML is
UTF-8
content="text/html; charset=UTF-8"
My Question are:
- Why is
requests.encoding
showing different encoding then the encoding described in the HTML page?.I am trying to convert the encoding into UTF-8 using this method
objReq.content.decode(encodes).encode("utf-8")
since it is already inUTF-8
when I do decode with ISO-8859-1 and encode with UTF-8 the values get changed i.e.)á
changes to thisÃ
Is there any way to convert all type of encodes into UTF-8 ?
解决方案Requests sets the
response.encoding
attribute toISO-8859-1
when you have atext/*
response and no content type has been specified in the response headers.See the Encoding section of the Advanced documentation:
The only time Requests will not do this is if no explicit charset is present in the HTTP headers and the
Content-Type
header containstext
. In this situation, RFC 2616 specifies that the default charset must beISO-8859-1
. Requests follows the specification in this case. If you require a different encoding, you can manually set theResponse.encoding
property, or use the rawResponse.content
.Bold emphasis mine.
You can test for this by looking for a
charset
parameter in theContent-Type
header:resp = requests.get(....) encoding = resp.encoding if 'charset' in resp.headers.get('content-type', '').lower() else None
Your HTML document specifies the content type in a
<meta>
header, and it is this header that is authoritative:<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
HTML 5 also defines a
<meta charset="..." />
tag, see <meta charset="utf-8"> vs <meta http-equiv="Content-Type">You should not recode HTML pages to UTF-8 if they contain such a header with a different codec. You must at the very least correct that header in that case.
Using BeautifulSoup:
# pass in explicit encoding if set as a header encoding = resp.encoding if 'charset' in resp.headers.get('content-type', '').lower() else None content = resp.content soup = BeautifulSoup(content, from_encoding=encoding) if soup.original_encoding != 'utf-8': meta = soup.select_one('meta[charset], meta[http-equiv="Content-Type"]') if meta: # replace the meta charset info before re-encoding if 'charset' in meta.attrs: meta['charset'] = 'utf-8' else: meta['content'] = 'text/html; charset=utf-8' # re-encode to UTF-8 content = soup.prettify() # encodes to UTF-8 by default
Similarly, other document standards may also specify specific encodings; XML for example is always UTF-8 unless specified by a
<?xml encoding="..." ... ?>
XML declaration, again part of the document.这篇关于请求模块编码提供与HTML编码不同的编码的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!