如何正确地分析UTF-8 EN codeD HTML与BeautifulSoup统一code字符串？ [英] How to correctly parse UTF-8 encoded HTML to Unicode strings with BeautifulSoup?

查看：165 发布时间：2016/8/5 18:53:54 python unicode utf-8 beautifulsoup urllib2

本文介绍了如何正确地分析UTF-8 EN codeD HTML与BeautifulSoup统一code字符串？的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我运行它取一个UTF-8-CN codeD网页的Python程序，我提取使用BeautifulSoup的HTML一些文本。

I'm running a Python program which fetches a UTF-8-encoded web page, and I extract some text from the HTML using BeautifulSoup.

然而，当我写这篇文章的文本文件（或打印在控制台上），它被写在一个意想不到的编码。

However, when I write this text to a file (or print it on the console), it gets written in an unexpected encoding.

样例程序：

import urllib2
from BeautifulSoup import BeautifulSoup

# Fetch URL
url = 'http://www.voxnow.de/'
request = urllib2.Request(url)
request.add_header('Accept-Encoding', 'utf-8')

# Response has UTF-8 charset header,
# and HTML body which is UTF-8 encoded
response = urllib2.urlopen(request)

# Parse with BeautifulSoup
soup = BeautifulSoup(response)

# Print title attribute of a <div> which uses umlauts (e.g. können)
print repr(soup.find('div', id='navbutton_account')['title'])

运行此给出结果：

Running this gives the result:

# u'Hier k\u0102\u015bnnen Sie sich kostenlos registrieren und / oder einloggen!'

但我希望一个Python的Uni code字符串来呈现 0 中的字können为< A HREF =http://www.fileformat.info/info/uni$c$c/char/00f6/index.htm> \\ XF6

# u'Hier k\xf6bnnen Sie sich kostenlos registrieren und / oder einloggen!'

我已经试过了'fromEncoding'参数传递给BeautifulSoup，并试图阅读（）和德code（）的响应的对象，但它要么没什么区别，或引发错误。

I've tried passing the 'fromEncoding' parameter to BeautifulSoup, and trying to read() and decode() the response object, but it either makes no difference, or throws an error.

通过命令卷曲www.voxnow.de | hexdump都-C ，我可以看到，该网页的确是UTF-8 EN codeD（即它包含 0xc3 0xb6 ）的 0 人物：

With the command curl www.voxnow.de | hexdump -C, I can see that the web page is indeed UTF-8 encoded (i.e. it contains 0xc3 0xb6) for the ö character:

      20 74 69 74 6c 65 3d 22  48 69 65 72 20 6b c3 b6  | title="Hier k..|
      6e 6e 65 6e 20 53 69 65  20 73 69 63 68 20 6b 6f  |nnen Sie sich ko|
      73 74 65 6e 6c 6f 73 20  72 65 67 69 73 74 72 69  |stenlos registri|

我超越了我的Python能力的限制，所以我在一个不知如何进一步调试这一点。有什么建议？

I'm beyond the limit of my Python abilities, so I'm at a loss as to how to debug this further. Any advice?

推荐答案

由于justhalf上面所指出的，在这里我的问题本质上是的这个问题。

As justhalf points out above, my question here is essentially a duplicate of this question.

HTML内容报道本身为UTF-8 EN codeD和，在大多数情况下它是，除了一两个流氓无效的UTF-8字符。

The HTML content reported itself as UTF-8 encoded and, for the most part it was, except for one or two rogue invalid UTF-8 characters.

这显然是混淆BeautifulSoup哪些编码被使用，并试图去首code为UTF-8传递给BeautifulSoup类似内容时，当
这样的：

This apparently confuses BeautifulSoup about which encoding is in use, and when trying to first decode as UTF-8 when passing the content to BeautifulSoup like this:

soup = BeautifulSoup(response.read().decode('utf-8'))

我会得到错误：

UnicodeDecodeError: 'utf8' codec can't decode bytes in position 186812-186813: 
                    invalid continuation byte

在输出更仔细地观察，有哪些是错误的连接$ C $光盘作为无效字节序列字符 U 的一个实例0xe3为0x9c ，而不是正确 0xc3为0x9c 。

Looking more closely at the output, there was an instance of the character Ü which was wrongly encoded as the invalid byte sequence 0xe3 0x9c, rather than the correct 0xc3 0x9c.

由于目前收视率最高的答案对这个问题所暗示的，无效的UTF-8字符可以在解析被删除，因此，只有有效的数据被传递到BeautifulSoup：

As the currently highest-rated answer on that question suggests, the invalid UTF-8 characters can be removed while parsing, so that only valid data is passed to BeautifulSoup:

soup = BeautifulSoup(response.read().decode('utf-8', 'ignore'))

这篇关于如何正确地分析UTF-8 EN codeD HTML与BeautifulSoup统一code字符串？的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

如何正确地分析UTF-8 EN codeD HTML与BeautifulSoup统一code字符串？ [英] How to correctly parse UTF-8 encoded HTML to Unicode strings with BeautifulSoup?

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

如何正确地分析UTF-8 EN codeD HTML与BeautifulSoup统一code字符串？ [英] How to correctly parse UTF-8 encoded HTML to Unicode strings with BeautifulSoup?

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭