网站的Python正确编码(Beautiful Soup) [英] Python correct encoding of Website (Beautiful Soup)

查看:26
本文介绍了网站的Python正确编码(Beautiful Soup)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试加载一个 html 页面并输出文本,即使我正确获取了网页,BeautifulSoup 还是以某种方式破坏了编码.

I am trying to load a html-page and output the text, even though i am getting the webpage correctly, BeautifulSoup destroys somehow the encoding.

来源:

# -*- coding: utf-8 -*-
import requests
from BeautifulSoup import BeautifulSoup

url = "http://www.columbia.edu/~fdc/utf8/"
r = requests.get(url)

encodedText = r.text.encode("utf-8")
soup = BeautifulSoup(encodedText)
text =  str(soup.findAll(text=True))
print text.decode("utf-8")

摘录输出:

...Odenwxc3xa4lderisch...

这应该是Odenwälderisch

推荐答案

你犯了两个错误;您错误地处理了编码,并将结果列表视为可以安全地转换为字符串而不会丢失信息的内容.

You are making two mistakes; you are mis-handling encoding, and you are treating a result list as something that can safely be converted to a string without loss of information.

首先,不要使用response.text!这里不是 BeautifulSoup 的错,您正在重新编码 Mojibake.当服务器未明确指定编码时,requests 库将默认为 text/* 内容类型使用 Latin-1 编码,因为 HTTP 标准规定这是默认.

First of all, don't use response.text! It is not BeautifulSoup at fault here, you are re-encoding a Mojibake. The requests library will default to Latin-1 encoding for text/* content types when the server doesn't explicitly specify an encoding, because the HTTP standard states that that is the default.

请参阅编码部分em>高级文档:

See the Encoding section of the Advanced documentation:

请求不会这样做的唯一一次是如果 HTTP 标头中不存在显式字符集并且Content-Type 标头包含 text.在这种情况下,RFC 2616 指定默认字符集必须是 ISO-8859-1.在这种情况下,请求遵循规范.如果您需要不同的编码,您可以手动设置 Response.encoding 属性,或使用原始的 Response.content.

The only time Requests will not do this is if no explicit charset is present in the HTTP headers and the Content-Type header contains text. In this situation, RFC 2616 specifies that the default charset must be ISO-8859-1. Requests follows the specification in this case. If you require a different encoding, you can manually set the Response.encoding property, or use the raw Response.content.

粗体强调我的.

改为传入 response.content 原始数据:

Pass in the response.content raw data instead:

soup = BeautifulSoup(r.content)

我看到您在使用 BeautifulSoup 3.您确实想升级到 BeautifulSoup 4;第 3 版已于 2012 年停产,其中包含几个错误.安装 beautifulsoup4 项目,并使用 from bs4 importBeautifulSoup.

I see that you are using BeautifulSoup 3. You really want to upgrade to BeautifulSoup 4 instead; version 3 has been discontinued in 2012, and contains several bugs. Install the beautifulsoup4 project, and use from bs4 import BeautifulSoup.

BeautifulSoup 4 通常在确定解析时使用的正确编码方面做得很好,无论是从 HTML <meta> 标签还是对所提供字节的统计分析.如果服务器确实提供了字符集,您仍然可以从响应中将其传递给 BeautifulSoup,但请先测试 requests 是否使用默认值:

BeautifulSoup 4 usually does a great job of figuring out the right encoding to use when parsing, either from a HTML <meta> tag or statistical analysis of the bytes provided. If the server does provide a characterset, you can still pass this into BeautifulSoup from the response, but do test first if requests used a default:

encoding = r.encoding if 'charset' in r.headers.get('content-type', '').lower() else None
parser = 'html.parser'  # or lxml or html5lib
soup = BeautifulSoup(r.content, parser, from_encoding=encoding)

最后但并非最不重要的是,使用 BeautifulSoup 4,您可以使用 soup.get_text() 从页面中提取所有文本:

Last but not least, with BeautifulSoup 4, you can extract all text from a page using soup.get_text():

text = soup.get_text()
print text

您正在将结果列表(soup.findAll() 的返回值)转换为字符串.这永远行不通,因为 Python 中的容器在列表中的每个元素上使用 repr() 来生成调试字符串,而对于字符串,这意味着您将获得任何不是的转义序列一个可打印的 ASCII 字符.

You are instead converting a result list (the return value of soup.findAll()) to a string. This never can work because containers in Python use repr() on each element in the list to produce a debugging string, and for strings that means you get escape sequences for anything not a printable ASCII character.

这篇关于网站的Python正确编码(Beautiful Soup)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆