请求模块编码提供与HTML编码不同的编码 [英] Requests module encoding provides different encode then HTML encode

查看：108 发布时间：2020/10/29 6:13:44 python encoding python-requests

本文介绍了请求模块编码提供与HTML编码不同的编码的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

请求模块 encoding 提供的编码与HTML页面中实际设置的编码不同

代码：

 导入请求
 URL = http://www.reynamining.com/nuevositio/ contacto.html 
 obj = requests.get（URL，timeout = 60，verify = False，allow_redirects = True）
 print obj.encoding

输出：

  ISO- 8859-1

其中在HTML中设置的实际编码为 UTF- 8 content = text / html; charset = UTF-8

我的问题是：

为什么 requests.encoding 显示的编码不同于描述的编码在HTML页面中？？

我正在尝试使用此方法 objReq将编码转换为UTF-8 .content.decode（encodes）.encode（ utf-8），因为当我解码时，它已经在 UTF-8 中使用ISO-8859-1并使用UTF-8进行编码，即会更改值，即á对此Ã <的更改/ p>

有没有办法将所有类型的编码转换为UTF-8？

解决方案

请求将具有<$ c的 response.encoding 属性设置为 ISO-8859-1 $ c> text / * 响应，并且没有在响应标头中指定任何内容类型。

请参阅高级中的编码部分em>文档：

只有在HTTP标头中没有显式字符集的情况下，请求才会执行此操作， em>和 Content-Type 标头包含文本。 在这种情况下，RFC 2616指定默认字符集必须为 ISO-8859-1 。在这种情况下，请求遵循规范。如果需要其他编码，则可以手动设置 Response.encoding 属性，或使用原始的 Response.content 。

重点突出的矿井。

您可以通过查看在 Content-Type 标题中的 charset 参数中：

  resp = request.get（....）
 encoding = resp.encoding如果resp.headers.get（'content-type'，`` ）.lower（）else None

您的HTML文档在< meta> 标头，并且此标头是权威的：

 <元http-equiv = Content-Type content = text / html; charset = UTF-8 />

HTML 5还定义了< meta charset = ... /> 标签，请参阅<元字符集= utf-8> vs< meta http-equiv = Content-Type>

您应该不将HTML页面重新编码为UTF如果它们包含带有不同编解码器的标头，则为-8。在这种情况下，您至少必须纠正该标头。

使用BeautifulSoup：

＃如果设置为标头则传递显式编码 encoding = resp.encoding如果resp.headers.get（'content-type'，`` ）.lower（）else None 内容=对应内容汤= BeautifulSoup（内容，from_encoding =编码）如果soup.original_encoding！='utf-8'： meta = soup.select_one（'meta [charset]，meta [http-equiv = Content-Type]'） if meta：＃在重新编码$ b $之前替换元字符集信息b如果meta.attrs中的'charset'： meta ['charset'] ='utf-8' else： meta ['content'] ='text / html; charset = utf-8' ＃重新编码为UTF-8 content = soup.prettify（）＃默认情况下编码为UTF-8

。例如，XML始终是UTF-8，除非由<？xml encoding = ... ...？> XML声明指定，该声明也是文档的一部分

The request module encoding provides different encoding then the actual set encoding in HTML page

Code:

import requests
URL = "http://www.reynamining.com/nuevositio/contacto.html"
obj = requests.get(URL, timeout=60, verify=False, allow_redirects=True)
print obj.encoding

Output:

ISO-8859-1

Where as the actual encoding set in the HTML is UTF-8 content="text/html; charset=UTF-8"

My Question are:

Why is requests.encoding showing different encoding then the encoding described in the HTML page?.

I am trying to convert the encoding into UTF-8 using this method objReq.content.decode(encodes).encode("utf-8") since it is already in UTF-8 when I do decode with ISO-8859-1 and encode with UTF-8 the values get changed i.e.) áchanges to this Ã

Is there any way to convert all type of encodes into UTF-8 ?

解决方案

Requests sets the response.encoding attribute to ISO-8859-1 when you have a text/* response and no content type has been specified in the response headers.

See the Encoding section of the Advanced documentation:

The only time Requests will not do this is if no explicit charset is present in the HTTP headers and the Content-Type header contains text. In this situation, RFC 2616 specifies that the default charset must be ISO-8859-1. Requests follows the specification in this case. If you require a different encoding, you can manually set the Response.encoding property, or use the raw Response.content.

Bold emphasis mine.

You can test for this by looking for a charset parameter in the Content-Type header:

resp = requests.get(....)
encoding = resp.encoding if 'charset' in resp.headers.get('content-type', '').lower() else None

Your HTML document specifies the content type in a <meta> header, and it is this header that is authoritative:

<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />

HTML 5 also defines a <meta charset="..." /> tag, see <meta charset="utf-8"> vs <meta http-equiv="Content-Type">

You should not recode HTML pages to UTF-8 if they contain such a header with a different codec. You must at the very least correct that header in that case.

Using BeautifulSoup:

# pass in explicit encoding if set as a header
encoding = resp.encoding if 'charset' in resp.headers.get('content-type', '').lower() else None
content = resp.content
soup = BeautifulSoup(content, from_encoding=encoding)
if soup.original_encoding != 'utf-8':
    meta = soup.select_one('meta[charset], meta[http-equiv="Content-Type"]')
    if meta:
        # replace the meta charset info before re-encoding
        if 'charset' in meta.attrs:
            meta['charset'] = 'utf-8'
        else:
            meta['content'] = 'text/html; charset=utf-8'
    # re-encode to UTF-8
    content = soup.prettify()  # encodes to UTF-8 by default

Similarly, other document standards may also specify specific encodings; XML for example is always UTF-8 unless specified by a <?xml encoding="..." ... ?> XML declaration, again part of the document.

这篇关于请求模块编码提供与HTML编码不同的编码的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

请求模块编码提供与HTML编码不同的编码 [英] Requests module encoding provides different encode then HTML encode

问题描述

相关文章

Python最新文章

热门教程

热门工具

登录关闭

请求模块编码提供与HTML编码不同的编码 [英] Requests module encoding provides different encode then HTML encode

问题描述

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭