请求如何确定响应的编码? [英] How does requests determine the encoding of a reponse?

查看:76
本文介绍了请求如何确定响应的编码?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

响应的 apparent_encoding 属性怎么可能不正确?

How can a response's apparent_encoding attribute be incorrect?

我有以下代码片段,演示了我的问题:

I have the below code snippet, demonstrates my question:

import requests

url = "https://item.jd.com/100000177760.html"

r = requests.get(url)

print(r.status_code, r.encoding)  # 200, gbk (correct)

print(r.apparent_encoding)  # GB2312 (wrong)

请求如何确定响应的字符编码?

How does requests determine the response's characters encoding?

推荐答案

请求 apparent_encoding 属性为已评估并用作 r.encoding 的值.

Requests extracts the encoding from the response's Content-Type header's charset parameter. If no charset is found in the header and the content-type is of type "text", ISO-8859-1 (latin-1) is assumed. Otherwise the response's apparent_encoding property is evaluated and used as the value of r.encoding.

apparent_encoding 是通过使用 chardet 库确定的响应主体的编码.

apparent_encoding is determined by using the chardet library to guess the encoding of the response body.

对于问题中的URL,编码在Content-Type标头中声明

In the case of the URL in the question, the encoding is declared in the Content-Type header

>>> r.headers['Content-Type']
'text/html; charset=gbk'

因此,只有通过执行 print(r.apparent_encoding)显式访问它,才会对 r.apparent_encoding 进行评估.

so r.apparent_encoding is not evaluated until it is explicitly accessed by executing print(r.apparent_encoding).

在这种特殊情况下,chardet似乎弄错了:响应的text属性可以使用gbk编解码器进行编码,但不能使用GB2312进行编码.

In this particular case, chardet seems to get it wrong: the response's text attribute can be encoded with the gbk codec, but not with GB2312.

这篇关于请求如何确定响应的编码?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆