请求模块编码提供与HTML编码不同的编码 [英] Requests module encoding provides different encode then HTML encode

查看:108
本文介绍了请求模块编码提供与HTML编码不同的编码的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

请求模块 encoding 提供的编码与HTML页面中实际设置的编码不同



代码:

 导入请求
URL = http://www.reynamining.com/nuevositio/ contacto.html
obj = requests.get(URL,timeout = 60,verify = False,allow_redirects = True)
print obj.encoding

输出:

  ISO- 8859-1 

其中在HTML中设置的实际编码为 UTF- 8 content = text / html; charset = UTF-8



我的问题是:


  1. 为什么 requests.encoding 显示的编码不同于描述的编码在HTML页面中??

我正在尝试使用此方法 objReq将编码转换为UTF-8 .content.decode(encodes).encode( utf-8),因为当我解码时,它已经在 UTF-8 中使用ISO-8859-1并使用UTF-8进行编码,即会更改值,即á对此Ã <的更改/ p>

有没有办法将所有类型的编码转换为UTF-8?

解决方案

请求将具有<$ c的 response.encoding 属性设置为 ISO-8859-1 $ c> text / * 响应,并且没有在响应标头中指定任何内容类型。



请参阅高级中的 编码部分em>文档


只有在HTTP标头中没有显式字符集的情况下,请求才会执行此操作, em>和 Content-Type 标头包含文本在这种情况下,RFC 2616指定默认字符集必须为 ISO-8859-1 。在这种情况下,请求遵循规范。如果需要其他编码,则可以手动设置 Response.encoding 属性,或使用原始的 Response.content


重点突出的矿井。



您可以通过查看在 Content-Type 标题中的 charset 参数中:

  resp = request.get(....)
encoding = resp.encoding如果resp.headers.get('content-type',`` ).lower()else None

您的HTML文档在< meta> 标头,并且此标头是权威的:

 <元http-equiv = Content-Type content = text / html; charset = UTF-8 /> 

HTML 5还定义了< meta charset = ... /> 标签,请参阅<元字符集= utf-8> vs< meta http-equiv = Content-Type>



您应该将HTML页面重新编码为UTF如果它们包含带有不同编解码器的标头,则为-8。在这种情况下,您至少必须纠正该标头



使用BeautifulSoup:

 #如果设置为标头则传递显式编码
encoding = resp.encoding如果resp.headers.get('content-type',`` ).lower()else None
内容=对应内容
汤= BeautifulSoup(内容,from_encoding =编码)
如果soup.original_encoding!='utf-8':
meta = soup.select_one('meta [charset],meta [http-equiv = Content-Type]')
if meta:
#在重新编码$ b $之前替换元字符集信息b如果meta.attrs中的'charset':
meta ['charset'] ='utf-8'
else:
meta ['content'] ='text / html; charset = utf-8'
#重新编码为UTF-8
content = soup.prettify()#默认情况下编码为UTF-8


。例如,XML始终是UTF-8,除非由<?xml encoding = ... ...?> XML声明指定,该声明也是文档的一部分


The request module encoding provides different encoding then the actual set encoding in HTML page

Code:

import requests
URL = "http://www.reynamining.com/nuevositio/contacto.html"
obj = requests.get(URL, timeout=60, verify=False, allow_redirects=True)
print obj.encoding

Output:

ISO-8859-1

Where as the actual encoding set in the HTML is UTF-8 content="text/html; charset=UTF-8"

My Question are:

  1. Why is requests.encoding showing different encoding then the encoding described in the HTML page?.

I am trying to convert the encoding into UTF-8 using this method objReq.content.decode(encodes).encode("utf-8") since it is already in UTF-8 when I do decode with ISO-8859-1 and encode with UTF-8 the values get changed i.e.) áchanges to this Ã

Is there any way to convert all type of encodes into UTF-8 ?

解决方案

Requests sets the response.encoding attribute to ISO-8859-1 when you have a text/* response and no content type has been specified in the response headers.

See the Encoding section of the Advanced documentation:

The only time Requests will not do this is if no explicit charset is present in the HTTP headers and the Content-Type header contains text. In this situation, RFC 2616 specifies that the default charset must be ISO-8859-1. Requests follows the specification in this case. If you require a different encoding, you can manually set the Response.encoding property, or use the raw Response.content.

Bold emphasis mine.

You can test for this by looking for a charset parameter in the Content-Type header:

resp = requests.get(....)
encoding = resp.encoding if 'charset' in resp.headers.get('content-type', '').lower() else None

Your HTML document specifies the content type in a <meta> header, and it is this header that is authoritative:

<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />

HTML 5 also defines a <meta charset="..." /> tag, see <meta charset="utf-8"> vs <meta http-equiv="Content-Type">

You should not recode HTML pages to UTF-8 if they contain such a header with a different codec. You must at the very least correct that header in that case.

Using BeautifulSoup:

# pass in explicit encoding if set as a header
encoding = resp.encoding if 'charset' in resp.headers.get('content-type', '').lower() else None
content = resp.content
soup = BeautifulSoup(content, from_encoding=encoding)
if soup.original_encoding != 'utf-8':
    meta = soup.select_one('meta[charset], meta[http-equiv="Content-Type"]')
    if meta:
        # replace the meta charset info before re-encoding
        if 'charset' in meta.attrs:
            meta['charset'] = 'utf-8'
        else:
            meta['content'] = 'text/html; charset=utf-8'
    # re-encode to UTF-8
    content = soup.prettify()  # encodes to UTF-8 by default

Similarly, other document standards may also specify specific encodings; XML for example is always UTF-8 unless specified by a <?xml encoding="..." ... ?> XML declaration, again part of the document.

这篇关于请求模块编码提供与HTML编码不同的编码的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆