使用beautifulsoup时如何找出正确的编码? [英] How to find out the correct encoding when using beautifulsoup?

查看:50
本文介绍了使用beautifulsoup时如何找出正确的编码?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在python3和beautifulsoup4中,我希望在发出请求后从网站获取信息.我是这样做的:

In python3 and beautifulsoup4 I want to get information from a website, after making the requests. I did so:

import requests
from bs4 import BeautifulSoup

req = requests.get('https://sisgvarmazenamento.blob.core.windows.net/prd/PublicacaoPortal/Arquivos/201901.htm').text

soup = BeautifulSoup(req,'lxml')

soup.find("h1").text
'\r\n                        CÃ\x82MARA MUNICIPAL DE SÃ\x83O PAULO'

我不知道编码是什么,但是它是一个使用巴西葡萄牙语的网站,因此它应该是utf-8或latin1

I do not know what the encoding is, but it's a site with Brazilian Portuguese, so it should be utf-8 or latin1

请,有没有办法找出哪种编码是正确的?

Please, is there a way to find out which encoding is correct?

然后beautifulsoup可以正确读取此编码吗?

And then do the beautifulsoup read this encoding correctly?

推荐答案

请求确定类似:

当您收到响应时,当您访问 Response.text 属性时,请求会猜测用于解码响应的编码.请求将首先检查HTTP标头中的编码,如果不存在,将使用chardet尝试猜测编码.

When you receive a response, Requests makes a guess at the encoding to use for decoding the response when you access the Response.text attribute. Requests will first check for an encoding in the HTTP header, and if none is present, will use chardet to attempt to guess the encoding.

只有在HTTP标头中没有显式字符集且Content-Type标头包含文本的情况下,Requests才会这样做.在这种情况下,RFC 2616指定默认字符集必须为ISO-8859-1.在这种情况下,请遵循规范.如果需要其他编码,则可以手动设置 Response.encoding 属性,或使用原始的 Response.content .

The only time Requests will not do this is if no explicit charset is present in the HTTP headers and the Content-Type header contains text. In this situation, RFC 2616 specifies that the default charset must be ISO-8859-1. Requests follows the specification in this case. If you require a different encoding, you can manually set the Response.encoding property, or use the raw Response.content.

检查请求标头表明,确实"HTTP标头中不存在显式字符集,并且Content-Type标头包含文本"

Inspecting the request headers show that indeed "no explicit charset is present in the HTTP headers and the Content-Type header contains text"

>>> req.headers['content-type']
'text/html'

因此,请求严格遵循标准,并解码为ISO-8859-1(拉丁语1).

So requests faithfully follows the standard and decodes as ISO-8859-1 (latin-1).

在响应内容中,指定了一个字符集:

In the response content, a charset is specified:

<META http-equiv="Content-Type" content="text/html; charset=utf-16">

但这是错误的:以UTF-16解码会产生mojibake.

however this is wrong: decoding as UTF-16 produces mojibake.

chardet 正确地将编码标识为UTF-8.

chardet correctly identifies the encoding as UTF-8.

总结一下:

  • 没有通用的方法来确定完全准确的文本编码
  • 在这种情况下,正确的编码是UTF-8.

工作代码:

>>> req.encoding = 'UTF-8'
>>> soup = bs4.BeautifulSoup(req.text,'lxml')
>>> soup.find('h1').text
'\r\n                        CÂMARA MUNICIPAL DE SÃO PAULO'

这篇关于使用beautifulsoup时如何找出正确的编码?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆