如何正确地分析UTF-8 EN codeD HTML与BeautifulSoup统一code字符串? [英] How to correctly parse UTF-8 encoded HTML to Unicode strings with BeautifulSoup?

查看:165
本文介绍了如何正确地分析UTF-8 EN codeD HTML与BeautifulSoup统一code字符串?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我运行它取一个UTF-8-CN codeD网页的Python程序,我提取使用BeautifulSoup的HTML一些文本。

I'm running a Python program which fetches a UTF-8-encoded web page, and I extract some text from the HTML using BeautifulSoup.

然而,当我写这篇文章的文本文件(或打印在控制台上),它被写在一个意想不到的编码。

However, when I write this text to a file (or print it on the console), it gets written in an unexpected encoding.

样例程序:

import urllib2
from BeautifulSoup import BeautifulSoup

# Fetch URL
url = 'http://www.voxnow.de/'
request = urllib2.Request(url)
request.add_header('Accept-Encoding', 'utf-8')

# Response has UTF-8 charset header,
# and HTML body which is UTF-8 encoded
response = urllib2.urlopen(request)

# Parse with BeautifulSoup
soup = BeautifulSoup(response)

# Print title attribute of a <div> which uses umlauts (e.g. können)
print repr(soup.find('div', id='navbutton_account')['title'])

运行此给出结果:

Running this gives the result:

# u'Hier k\u0102\u015bnnen Sie sich kostenlos registrieren und / oder einloggen!'

但我希望一个Python的Uni code字符串来呈现 0 中的字können为< A HREF =htt​​p://www.fileformat.info/info/uni$c$c/char/00f6/index.htm> \\ XF6

# u'Hier k\xf6bnnen Sie sich kostenlos registrieren und / oder einloggen!'

我已经试过了'fromEncoding'参数传递给BeautifulSoup,并试图阅读()德code()响应的对象,但它要么没什么区别,或引发错误。

I've tried passing the 'fromEncoding' parameter to BeautifulSoup, and trying to read() and decode() the response object, but it either makes no difference, or throws an error.

通过命令卷曲www.voxnow.de | hexdump都-C ,我可以看到,该网页的确是UTF-8 EN codeD(即它包含 0xc3 0xb6 )的 0 人物:

With the command curl www.voxnow.de | hexdump -C, I can see that the web page is indeed UTF-8 encoded (i.e. it contains 0xc3 0xb6) for the ö character:

      20 74 69 74 6c 65 3d 22  48 69 65 72 20 6b c3 b6  | title="Hier k..|
      6e 6e 65 6e 20 53 69 65  20 73 69 63 68 20 6b 6f  |nnen Sie sich ko|
      73 74 65 6e 6c 6f 73 20  72 65 67 69 73 74 72 69  |stenlos registri|

我超越了我的Python能力的限制,所以我在一个不知如何进一步调试这一点。有什么建议?

I'm beyond the limit of my Python abilities, so I'm at a loss as to how to debug this further. Any advice?

推荐答案

由于justhalf上面所指出的,在这里我的问题本质上是的这个问题

As justhalf points out above, my question here is essentially a duplicate of this question.

HTML内容报道本身为UTF-8 EN codeD和,在大多数情况下它是,除了一两个流氓无效的UTF-8字符。

The HTML content reported itself as UTF-8 encoded and, for the most part it was, except for one or two rogue invalid UTF-8 characters.

这显然是混淆BeautifulSoup哪些编码被使用,并试图去首code为UTF-8传递给BeautifulSoup类似内容时,当
这样的:

This apparently confuses BeautifulSoup about which encoding is in use, and when trying to first decode as UTF-8 when passing the content to BeautifulSoup like this:

soup = BeautifulSoup(response.read().decode('utf-8'))

我会得到错误:

UnicodeDecodeError: 'utf8' codec can't decode bytes in position 186812-186813: 
                    invalid continuation byte

在输出更仔细地观察,有哪些是错误的连接$ C $光盘作为无效字节序列字符 U 的一个实例0xe3为0x9c ,而不是正确 0xc3为0x9c

Looking more closely at the output, there was an instance of the character Ü which was wrongly encoded as the invalid byte sequence 0xe3 0x9c, rather than the correct 0xc3 0x9c.

由于目前收视率最高的答案对这个问题所暗示的,无效的UTF-8字符可以在解析被删除,因此,只有有效的数据被传递到BeautifulSoup:

As the currently highest-rated answer on that question suggests, the invalid UTF-8 characters can be removed while parsing, so that only valid data is passed to BeautifulSoup:

soup = BeautifulSoup(response.read().decode('utf-8', 'ignore'))

这篇关于如何正确地分析UTF-8 EN codeD HTML与BeautifulSoup统一code字符串?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆