网站的Python正确的编码（美味的汤） [英] Python correct encoding of Website (Beautiful Soup)

查看：166 发布时间：2016/8/5 19:15:02 python utf-8 beautifulsoup

本文介绍了网站的Python正确的编码（美味的汤）的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我试图加载HTML页面和输出的文本，即使我正在正确地获取网页，BeautifulSoup某种程度上破坏了编码。

I am trying to load a html-page and output the text, even though i am getting the webpage correctly, BeautifulSoup destroys somehow the encoding.

来源：

# -*- coding: utf-8 -*-
import requests
from BeautifulSoup import BeautifulSoup

url = "http://www.columbia.edu/~fdc/utf8/"
r = requests.get(url)

encodedText = r.text.encode("utf-8")
soup = BeautifulSoup(encodedText)
text =  str(soup.findAll(text=True))
print text.decode("utf-8")

摘录输出：

...Odenw\xc3\xa4lderisch...

这应该是的Odenwälderisch

推荐答案

您正在两个错误;你是错误处理的编码，而你治疗结果列表的东西可以安全地转换为字符串而不会丢失信息。

You are making two mistakes; you are mis-handling encoding, and you are treating a result list as something that can safely be converted to a string without loss of information.

首先，不使用 response.text ！这是没有过错BeautifulSoup在这里，您要重新编码变为乱码。在要求图书馆将默认为Latin-1编码为文/ * 内容类型时，服务器不会明确指定编码，因为HTTP标准的规定，这是默认的。

First of all, don't use response.text! It is not BeautifulSoup at fault here, you are re-encoding a Mojibake. The requests library will default to Latin-1 encoding for text/* content types when the server doesn't explicitly specify an encoding, because the HTTP standard states that that is the default.

查看的编码的的部分的高级的文档：

See the Encoding section of the Advanced documentation:

唯一一次的请求是不会这样做的是，如果没有明确的字符集是在HTTP头present的和的的内容类型头包含文本。 在这种情况下，RFC 2616规定的默认字符集必须是 ISO-8859-1 。请求如下这种情况下的规范。如果你需要一个不同的编码，您可以手动设置 Response.encoding 属性，或者使用原始 Response.content

The only time Requests will not do this is if no explicit charset is present in the HTTP headers and the Content-Type header contains text. In this situation, RFC 2616 specifies that the default charset must be ISO-8859-1. Requests follows the specification in this case. If you require a different encoding, you can manually set the Response.encoding property, or use the raw Response.content.

大胆重点煤矿。

传入 response.content 的原始数据，而不是：

Pass in the response.content raw data instead:

soup = BeautifulSoup(r.content)

我看到你正在使用BeautifulSoup 3.你真的想升级到BeautifulSoup 4代替;第3版已经停产，2012年，并且包含了几个错误。安装 beautifulsoup4 项目，并使用从BS4进口BeautifulSoup 。

BeautifulSoup 4通常不会找出正确的编码一个伟大的工作来解析的时候，无论是从一个HTML中使用＆LT;元＆GT; 标签或字节的统计分析提供。如果服务器不提供一个字符集，你仍然可以通过这个进BeautifulSoup从响应，但做测试第一，如果要求使用默认的：

BeautifulSoup 4 usually does a great job of figuring out the right encoding to use when parsing, either from a HTML <meta> tag or statistical analysis of the bytes provided. If the server does provide a characterset, you can still pass this into BeautifulSoup from the response, but do test first if requests used a default:

encoding = r.encoding if 'charset' in r.headers.get('content-type', '').lower() else None
soup = BeautifulSoup(r.content, from_encoding=encoding)

最后但并非最不重要的，与BeautifulSoup 4，可以使用提取网页中的所有文本 soup.get_text（）：

text = soup.get_text()
print text

您是一个的结果列表的（返回 soup.findAll（）的值），而不是转换为一个字符串。这永远不能工作，因为在Python容器使用再版（）每个元素在列表中产生的调试字符串的，而对于字符串，这意味着你得到逃避任何序列不是一个可打印的ASCII字符。

You are instead converting a result list (the return value of soup.findAll()) to a string. This never can work because containers in Python use repr() on each element in the list to produce a debugging string, and for strings that means you get escape sequences for anything not a printable ASCII character.

这篇关于网站的Python正确的编码（美味的汤）的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

网站的Python正确的编码（美味的汤） [英] Python correct encoding of Website (Beautiful Soup)

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

网站的Python正确的编码（美味的汤） [英] Python correct encoding of Website (Beautiful Soup)

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭