如何使用BeautifulSoup解析具有非ASCII字符的HTML? [英] How to Parse HTML with Non-ASCII Characters using BeautifulSoup?

查看:83
本文介绍了如何使用BeautifulSoup解析具有非ASCII字符的HTML?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

尝试使用BeautifulSoup解析某些html时,我不断收到以下错误:

I keep getting the following error when trying to parse some html using BeautifulSoup:

UnicodeDecodeError: 'ascii' codec can't decode byte 0xae in position 0: ordinal not in range(128)

我尝试使用以下问题的解决方案来解码html,但始终会遇到相同的错误.我已经尝试了以下所有问题的解决方案,但都无法解决(发布信息,以便避免重复的答案,以防万一他们通过查看问题的相关方法来帮助任何人找到解决方案).

I've tried decoding the html using the solution to the questions below, but keep getting the same error. I've tried all the solutions to the questions below but none of them work (posting so that I don't get duplicate answers and in case they help anyone to find a solution by viewing related approaches to the problem).

有人知道我在哪里错吗?这是BeautifulSoup中的错误,我应该安装早期版本吗?

Anybody know where I'm going wrong here? Is this a bug in BeautifulSoup and should I install an earlier version?

下面的代码和回溯:

from BeautifulSoup import BeautifulSoup as bs
soup = bs(html)

Traceback (most recent call last):
  File "<console>", line 1, in <module>
  File "/var/lib/python-support/python2.5/BeautifulSoup.py", line 1282, in __init__
    BeautifulStoneSoup.__init__(self, *args, **kwargs)
  File "/var/lib/python-support/python2.5/BeautifulSoup.py", line 946, in __init__
    self._feed()
  File "/var/lib/python-support/python2.5/BeautifulSoup.py", line 971, in _feed
    SGMLParser.feed(self, markup)
  File "/usr/lib/python2.5/sgmllib.py", line 99, in feed
    self.goahead(0)
  File "/usr/lib/python2.5/sgmllib.py", line 133, in goahead
    k = self.parse_starttag(i)
  File "/usr/lib/python2.5/sgmllib.py", line 285, in parse_starttag
    self._convert_ref, attrvalue)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xae in position 0: ordinal not in range(128)

以下每个评论的错误消息:

error message per comment below:

Traceback (most recent call last):
  File "<console>", line 1, in <module>
  File "/var/lib/python-support/python2.5/BeautifulSoup.py", line 1282, in __init__
    BeautifulStoneSoup.__init__(self, *args, **kwargs)
  File "/var/lib/python-support/python2.5/BeautifulSoup.py", line 946, in __init__
    self._feed()
  File "/var/lib/python-support/python2.5/BeautifulSoup.py", line 971, in _feed
    SGMLParser.feed(self, markup)
  File "/usr/lib/python2.5/sgmllib.py", line 99, in feed
    self.goahead(0)
  File "/usr/lib/python2.5/sgmllib.py", line 133, in goahead
    k = self.parse_starttag(i)
  File "/usr/lib/python2.5/sgmllib.py", line 285, in parse_starttag
    self._convert_ref, attrvalue)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xae in position 0: ordinal not in range(128)

感谢您的帮助!

beautifulsoup中的"ascii"编解码器错误

UnicodeEncodeError:"ascii"编解码器无法在位置0处编码字符u'\ xef':序数不在范围(128)中

我如何使用Python将文件的格式从Unicode转换为ASCII?

python UnicodeEncodeError>我该如何简单地删除令人烦恼的Unicode字符?

UnicodeEncodeError:"ascii"编解码器无法在位置0处编码字符u'\ xef':序数不在范围(128)中

推荐答案

您在评论中说:"我只是查看要尝试解析的html的内容类型,以查看它是否是我的东西没尝试过(之前我只是假设它是UTF-8),但可以肯定的是它是UTF-8,所以又出现了死胡同.""

You say in a comment: """I just looked up the content-type of the html I'm trying to parse to see if it was something I hadn't tried (earlier I just assumed it was UTF-8) but sure enough it was UTF-8 so another dead end."""

叹息.这就是为什么我一直试图让您透露您试图解析的HTML的原因.错误消息表明(第一个)问题字节是\xae,这绝对不是UTF-8序列中的有效前导字节.

Sigh. This is exactly why I have been trying to get you to divulge the HTML that you are trying to parse. The error message indicates that the (first) problem byte is \xae which is definitely NOT a valid lead byte in a UTF-8 sequence.

要么泄露指向HTML的链接,要么进行一些基本的调试:

Either divulge the link to your HTML, or do some basic debugging:

uc = html.decode('utf8')是工作还是失败?如果失败,会显示什么错误消息?

Does uc = html.decode('utf8') work or fail? If fail, with what error message?

您还说过:"我开始认为这是BS中的错误,他们在文档中提到了这些错误,可以在这里查看:crummy.com/software/BeautifulSoup/CHANGELOG.html." "

You also said: """I'm starting to think this is a bug in BS, which they allude to in the docs, and can be seen here: crummy.com/software/BeautifulSoup/CHANGELOG.html."""

我无法想象您所指的是变更日志中哪个模糊的条目.在急于更新之前,请考虑调试问题.

I can't imagine which of the vague entries in the changelog you are referring to. Consider debugging your problem before you rush to update.

更新看起来像是sgmllib.py中一个晦涩的错误.在394行中,将255更改为127,它似乎可以工作.特殊情况:属性值中的HTML char ref(&#174;)且128< =序数< 255.

Update Looks like an obscure bug in sgmllib.py. In line 394, change 255 to 127 and it appears to work. Corner case: HTML char ref (&#174;) in an attribute value AND with 128 <= ordinal < 255.

更多评论与其破解您的sgmllib.py副本,不如从2.7分支中获取最新的sgmllib.py副本– BS 3.0.4在Python 2.7.1上为我运行了.更好的是,将Python升级到2.7.

Further comments Rather than hack your copy of sgmllib.py, grab a copy of the latest sgmllib.py from the 2.7 branch -- BS 3.0.4 ran OK for me on Python 2.7.1. Even better, upgrade your Python to 2.7.

这篇关于如何使用BeautifulSoup解析具有非ASCII字符的HTML?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆