如何使用 BeautifulSoup 将 UTF-8 编码的 HTML 正确解析为 Unicode 字符串? [英] How to correctly parse UTF-8 encoded HTML to Unicode strings with BeautifulSoup?

查看:15
本文介绍了如何使用 BeautifulSoup 将 UTF-8 编码的 HTML 正确解析为 Unicode 字符串?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在运行一个 Python 程序,它获取一个 UTF-8 编码的网页,我使用 BeautifulSoup 从 HTML 中提取了一些文本.

但是,当我将此文本写入文件(或在控制台上打印)时,它以意外的编码写入.

示例程序:

import urllib2从 BeautifulSoup 导入 BeautifulSoup# 获取网址url = 'http://www.voxnow.de/'请求 = urllib2.Request(url)request.add_header('接受编码', 'utf-8')# 响应具有 UTF-8 字符集标头,# 和 UTF-8 编码的 HTML 正文响应 = urllib2.urlopen(请求)# 用 BeautifulSoup 解析汤 = BeautifulSoup(响应)# 打印 

的标题属性使用变音符号(例如 können)打印 repr(soup.find('div', id='navbutton_account')['title'])

运行得到结果:

# u'Hier ku0102u015bnnen Sie sich kostenlos registrieren und/oder einloggen!'

但我希望 Python Unicode 字符串在 können 一词中将 ö 呈现为 xf6:

# u'Hier kxf6bnnen Sie sich kostenlos registrieren und/oder einloggen!'

我已经尝试将 'fromEncoding' 参数传递给 BeautifulSoup,并尝试 read()decode() response 对象,但它要么没有区别,要么抛出错误.

使用命令 curl www.voxnow.de |hexdump -C,我可以看到该网页的 ö 字符确实是 UTF-8 编码的(即它包含 0xc3 0xb6):

 20 74 69 74 6c 65 3d 22 48 69 65 72 20 6b c3 b6 |标题="你好..|6e 6e 65 6e 20 53 69 65 20 73 69 63 68 20 6b 6f |nnen Sie sich ko|73 74 65 6e 6c 6f 73 20 72 65 67 69 73 74 72 69 |stenlos registri|

我已经超出了我的 Python 能力的极限,所以我不知道如何进一步调试它.有什么建议吗?

解决方案

正如 justhalf 上面指出的,我在这里的问题本质上是 这个问题.

HTML 内容将自身报告为 UTF-8 编码,并且在大多数情况下是这样,除了一两个流氓无效 UTF-8 字符.

这显然混淆了 BeautifulSoup 关于正在使用的编码,以及在将内容传递给 BeautifulSoup 时尝试首先解码为 UTF-8 时这个:

soup = BeautifulSoup(response.read().decode('utf-8'))

我会得到错误:

UnicodeDecodeError: 'utf8' 编解码器无法解码位置 186812-186813 中的字节:无效的继续字节

更仔细地查看输出,有一个字符 Ü 的实例被错误地编码为无效的字节序列 0xe3 0x9c,而不是正确的 0xc3 0x9c.

正如目前关于该问题的评分最高的答案所暗示的,可以在解析时删除无效的 UTF-8 字符,以便仅将有效数据传递给 BeautifulSoup:

soup = BeautifulSoup(response.read().decode('utf-8', 'ignore'))

I'm running a Python program which fetches a UTF-8-encoded web page, and I extract some text from the HTML using BeautifulSoup.

However, when I write this text to a file (or print it on the console), it gets written in an unexpected encoding.

Sample program:

import urllib2
from BeautifulSoup import BeautifulSoup

# Fetch URL
url = 'http://www.voxnow.de/'
request = urllib2.Request(url)
request.add_header('Accept-Encoding', 'utf-8')

# Response has UTF-8 charset header,
# and HTML body which is UTF-8 encoded
response = urllib2.urlopen(request)

# Parse with BeautifulSoup
soup = BeautifulSoup(response)

# Print title attribute of a <div> which uses umlauts (e.g. können)
print repr(soup.find('div', id='navbutton_account')['title'])

Running this gives the result:

# u'Hier ku0102u015bnnen Sie sich kostenlos registrieren und / oder einloggen!'

But I would expect a Python Unicode string to render ö in the word können as xf6:

# u'Hier kxf6bnnen Sie sich kostenlos registrieren und / oder einloggen!'

I've tried passing the 'fromEncoding' parameter to BeautifulSoup, and trying to read() and decode() the response object, but it either makes no difference, or throws an error.

With the command curl www.voxnow.de | hexdump -C, I can see that the web page is indeed UTF-8 encoded (i.e. it contains 0xc3 0xb6) for the ö character:

      20 74 69 74 6c 65 3d 22  48 69 65 72 20 6b c3 b6  | title="Hier k..|
      6e 6e 65 6e 20 53 69 65  20 73 69 63 68 20 6b 6f  |nnen Sie sich ko|
      73 74 65 6e 6c 6f 73 20  72 65 67 69 73 74 72 69  |stenlos registri|

I'm beyond the limit of my Python abilities, so I'm at a loss as to how to debug this further. Any advice?

解决方案

As justhalf points out above, my question here is essentially a duplicate of this question.

The HTML content reported itself as UTF-8 encoded and, for the most part it was, except for one or two rogue invalid UTF-8 characters.

This apparently confuses BeautifulSoup about which encoding is in use, and when trying to first decode as UTF-8 when passing the content to BeautifulSoup like this:

soup = BeautifulSoup(response.read().decode('utf-8'))

I would get the error:

UnicodeDecodeError: 'utf8' codec can't decode bytes in position 186812-186813: 
                    invalid continuation byte

Looking more closely at the output, there was an instance of the character Ü which was wrongly encoded as the invalid byte sequence 0xe3 0x9c, rather than the correct 0xc3 0x9c.

As the currently highest-rated answer on that question suggests, the invalid UTF-8 characters can be removed while parsing, so that only valid data is passed to BeautifulSoup:

soup = BeautifulSoup(response.read().decode('utf-8', 'ignore'))

这篇关于如何使用 BeautifulSoup 将 UTF-8 编码的 HTML 正确解析为 Unicode 字符串?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆