BeautifulSoup找不到正确分析的元素 [英] BeautifulSoup doesn't find correctly parsed elements
问题描述
我正在使用BeautifulSoup
解析一堆可能非常脏的HTML
文档.我偶然发现了一件非常奇怪的事情.
I am using BeautifulSoup
to parse a bunch of possibly very dirty HTML
documents. I stumbled upon a very bizarre thing.
HTML来自以下页面: http://www.wvdnr.gov/
The HTML comes from this page: http://www.wvdnr.gov/
它包含多个错误,例如多个<html></html>
,<head>
之外的多个<title>
,等等...
It contains multiple errors, like multiple <html></html>
, <title>
outside the <head>
, etc...
但是,即使在这些情况下,html5lib通常也可以正常工作.实际上,当我这样做时:
However, html5lib usually works well even in these cases. In fact, when I do:
soup = BeautifulSoup(document, "html5lib")
和我的pretti-print soup
,我看到以下输出: http://pastebin.com/8BKapx88
and I pretti-print soup
, I see the following output: http://pastebin.com/8BKapx88
其中包含很多<a>
标签.
但是,当我执行soup.find_all("a")
时,我得到一个空列表.使用lxml
,我会得到相同的结果.
However, when I do soup.find_all("a")
I get an empty list. With lxml
I get the same.
因此:以前有人在这个问题上迷迷糊糊吗?到底是怎么回事?我如何获取html5lib
找到但未随find_all
返回的链接?
So: has anybody stumbled on this problem before? What is going on? How do I get the links that html5lib
found but isn't returning with find_all
?
推荐答案
When it comes to parsing a not well-formed and tricky HTML, the parser choice is very important:
HTML解析器之间也存在差异.如果你给美丽 汤一个格式完美的HTML文档,这些差异无关紧要. 一个解析器会比另一个解析器快,但是它们都会给您一个 看起来与原始HTML文档完全一样的数据结构.
There are also differences between HTML parsers. If you give Beautiful Soup a perfectly-formed HTML document, these differences won’t matter. One parser will be faster than another, but they’ll all give you a data structure that looks exactly like the original HTML document.
但是,如果文档的格式不正确,则不同的解析器将 给出不同的结果.
But if the document is not perfectly-formed, different parsers will give different results.
html.parser
为我工作:
from bs4 import BeautifulSoup
import requests
document = requests.get('http://www.wvdnr.gov/').content
soup = BeautifulSoup(document, "html.parser")
print soup.find_all('a')
演示:
>>> from bs4 import BeautifulSoup
>>> import requests
>>> document = requests.get('http://www.wvdnr.gov/').content
>>>
>>> soup = BeautifulSoup(document, "html5lib")
>>> len(soup.find_all('a'))
0
>>> soup = BeautifulSoup(document, "lxml")
>>> len(soup.find_all('a'))
0
>>> soup = BeautifulSoup(document, "html.parser")
>>> len(soup.find_all('a'))
147
另请参阅:
这篇关于BeautifulSoup找不到正确分析的元素的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!