请求如何确定响应的编码? [英] How does requests determine the encoding of a reponse?
问题描述
响应的 apparent_encoding
属性怎么可能不正确?
How can a response's apparent_encoding
attribute be incorrect?
我有以下代码片段,演示了我的问题:
I have the below code snippet, demonstrates my question:
import requests
url = "https://item.jd.com/100000177760.html"
r = requests.get(url)
print(r.status_code, r.encoding) # 200, gbk (correct)
print(r.apparent_encoding) # GB2312 (wrong)
请求如何确定响应的字符编码?
How does requests determine the response's characters encoding?
推荐答案
请求 apparent_encoding 属性为已评估并用作 r.encoding
的值.
Requests extracts the encoding from the response's Content-Type header's charset
parameter. If no charset
is found in the header and the content-type is of type "text", ISO-8859-1 (latin-1) is assumed. Otherwise the response's apparent_encoding
property is evaluated and used as the value of r.encoding
.
apparent_encoding
是通过使用 chardet 库确定的响应主体的编码.
apparent_encoding
is determined by using the chardet library to guess the encoding of the response body.
对于问题中的URL,编码在Content-Type标头中声明
In the case of the URL in the question, the encoding is declared in the Content-Type header
>>> r.headers['Content-Type']
'text/html; charset=gbk'
因此,只有通过执行 print(r.apparent_encoding)
显式访问它,才会对 r.apparent_encoding
进行评估.
so r.apparent_encoding
is not evaluated until it is explicitly accessed by executing print(r.apparent_encoding)
.
在这种特殊情况下,chardet似乎弄错了:响应的text属性可以使用gbk编解码器进行编码,但不能使用GB2312进行编码.
In this particular case, chardet seems to get it wrong: the response's text attribute can be encoded with the gbk codec, but not with GB2312.
这篇关于请求如何确定响应的编码?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!