Python中最宽容的HTML解析器是什么? [英] What’s the most forgiving HTML parser in Python?
问题描述
我有一些随机HTML,我使用BeautifulSoup对其进行了解析,但是在大多数情况下(> 70%),它会阻塞.我尝试使用Beautiful汤3.0.8和3.2.0(3.1.0以上版本存在一些问题),但结果几乎相同.
I have some random HTML and I used BeautifulSoup to parse it, but in most of the cases (>70%) it chokes. I tried using Beautiful soup 3.0.8 and 3.2.0 (there were some problems with 3.1.0 upwards), but the results are almost same.
我可以从脑海中回想起Python中可用的几个HTML解析器选项:
I can recall several HTML parser options available in Python from the top of my head:
- BeautifulSoup
- lxml
- pyquery
我打算测试所有这些内容,但我想知道测试中哪一个最宽容,甚至可以尝试解析不良的HTML.
I intend to test all of these, but I wanted to know which one in your tests come as most forgiving and can even try to parse bad HTML.
推荐答案
我最终将BeautifulSoup 4.0与html5lib一起使用进行了解析,并且可以宽容得多,对我的代码进行了一些修改,现在可以正常工作了,谢谢大家的建议.
I ended up using BeautifulSoup 4.0 with html5lib for parsing and is much more forgiving, with some modifications to my code it's now working considerabily well, thanks all for suggestions.
这篇关于Python中最宽容的HTML解析器是什么?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!