如何正确地分析UTF-8 EN codeD HTML与BeautifulSoup统一code字符串? [英] How to correctly parse UTF-8 encoded HTML to Unicode strings with BeautifulSoup?
问题描述
我运行它取一个UTF-8-CN codeD网页的Python程序,我提取使用BeautifulSoup的HTML一些文本。
I'm running a Python program which fetches a UTF-8-encoded web page, and I extract some text from the HTML using BeautifulSoup.
然而,当我写这篇文章的文本文件(或打印在控制台上),它被写在一个意想不到的编码。
However, when I write this text to a file (or print it on the console), it gets written in an unexpected encoding.
样例程序:
import urllib2
from BeautifulSoup import BeautifulSoup
# Fetch URL
url = 'http://www.voxnow.de/'
request = urllib2.Request(url)
request.add_header('Accept-Encoding', 'utf-8')
# Response has UTF-8 charset header,
# and HTML body which is UTF-8 encoded
response = urllib2.urlopen(request)
# Parse with BeautifulSoup
soup = BeautifulSoup(response)
# Print title attribute of a <div> which uses umlauts (e.g. können)
print repr(soup.find('div', id='navbutton_account')['title'])
运行此给出结果:
Running this gives the result:
# u'Hier k\u0102\u015bnnen Sie sich kostenlos registrieren und / oder einloggen!'
但我希望一个Python的Uni code字符串来呈现 0
中的字können
为< A HREF =http://www.fileformat.info/info/uni$c$c/char/00f6/index.htm> \\ XF6
# u'Hier k\xf6bnnen Sie sich kostenlos registrieren und / oder einloggen!'
我已经试过了'fromEncoding'参数传递给BeautifulSoup,并试图阅读()
和德code()
的响应
的对象,但它要么没什么区别,或引发错误。
I've tried passing the 'fromEncoding' parameter to BeautifulSoup, and trying to read()
and decode()
the response
object, but it either makes no difference, or throws an error.
通过命令卷曲www.voxnow.de | hexdump都-C
,我可以看到,该网页的确是UTF-8 EN codeD(即它包含 0xc3 0xb6
)的 0
人物:
With the command curl www.voxnow.de | hexdump -C
, I can see that the web page is indeed UTF-8 encoded (i.e. it contains 0xc3 0xb6
) for the ö
character:
20 74 69 74 6c 65 3d 22 48 69 65 72 20 6b c3 b6 | title="Hier k..|
6e 6e 65 6e 20 53 69 65 20 73 69 63 68 20 6b 6f |nnen Sie sich ko|
73 74 65 6e 6c 6f 73 20 72 65 67 69 73 74 72 69 |stenlos registri|
我超越了我的Python能力的限制,所以我在一个不知如何进一步调试这一点。有什么建议?
I'm beyond the limit of my Python abilities, so I'm at a loss as to how to debug this further. Any advice?
推荐答案
由于justhalf上面所指出的,在这里我的问题本质上是的这个问题。
As justhalf points out above, my question here is essentially a duplicate of this question.
HTML内容报道本身为UTF-8 EN codeD和,在大多数情况下它是,除了一两个流氓无效的UTF-8字符。
The HTML content reported itself as UTF-8 encoded and, for the most part it was, except for one or two rogue invalid UTF-8 characters.
这显然是混淆BeautifulSoup哪些编码被使用,并试图去首code为UTF-8传递给BeautifulSoup类似内容时,当
这样的:
This apparently confuses BeautifulSoup about which encoding is in use, and when trying to first decode as UTF-8 when passing the content to BeautifulSoup like this:
soup = BeautifulSoup(response.read().decode('utf-8'))
我会得到错误:
UnicodeDecodeError: 'utf8' codec can't decode bytes in position 186812-186813:
invalid continuation byte
在输出更仔细地观察,有哪些是错误的连接$ C $光盘作为无效字节序列字符
,而不是正确 U
的一个实例0xe3为0x9c 0xc3为0x9c
。
Looking more closely at the output, there was an instance of the character Ü
which was wrongly encoded as the invalid byte sequence 0xe3 0x9c
, rather than the correct 0xc3 0x9c
.
由于目前收视率最高的答案对这个问题所暗示的,无效的UTF-8字符可以在解析被删除,因此,只有有效的数据被传递到BeautifulSoup:
As the currently highest-rated answer on that question suggests, the invalid UTF-8 characters can be removed while parsing, so that only valid data is passed to BeautifulSoup:
soup = BeautifulSoup(response.read().decode('utf-8', 'ignore'))
这篇关于如何正确地分析UTF-8 EN codeD HTML与BeautifulSoup统一code字符串?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!