BeautifulSoup找不到正确分析的元素 [英] BeautifulSoup doesn't find correctly parsed elements

查看:133
本文介绍了BeautifulSoup找不到正确分析的元素的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用BeautifulSoup解析一堆可能非常脏的HTML文档.我偶然发现了一件非常奇怪的事情.

I am using BeautifulSoup to parse a bunch of possibly very dirty HTML documents. I stumbled upon a very bizarre thing.

HTML来自以下页面: http://www.wvdnr.gov/

The HTML comes from this page: http://www.wvdnr.gov/

它包含多个错误,例如多个<html></html><head>之外的多个<title>,等等...

It contains multiple errors, like multiple <html></html>, <title> outside the <head>, etc...

但是,即使在这些情况下,html5lib通常也可以正常工作.实际上,当我这样做时:

However, html5lib usually works well even in these cases. In fact, when I do:

soup = BeautifulSoup(document, "html5lib")

和我的pretti-print soup,我看到以下输出: http://pastebin.com/8BKapx88

and I pretti-print soup, I see the following output: http://pastebin.com/8BKapx88

其中包含很多<a>标签.

但是,当我执行soup.find_all("a")时,我得到一个空列表.使用lxml,我会得到相同的结果.

However, when I do soup.find_all("a") I get an empty list. With lxml I get the same.

因此:以前有人在这个问题上迷迷糊糊吗?到底是怎么回事?我如何获取html5lib找到但未随find_all返回的链接?

So: has anybody stumbled on this problem before? What is going on? How do I get the links that html5lib found but isn't returning with find_all?

推荐答案

在解析格式不正确且棘手的HTML时,请

When it comes to parsing a not well-formed and tricky HTML, the parser choice is very important:

HTML解析器之间也存在差异.如果你给美丽 汤一个格式完美的HTML文档,这些差异无关紧要. 一个解析器会比另一个解析器快,但是它们都会给您一个 看起来与原始HTML文档完全一样的数据结构.

There are also differences between HTML parsers. If you give Beautiful Soup a perfectly-formed HTML document, these differences won’t matter. One parser will be faster than another, but they’ll all give you a data structure that looks exactly like the original HTML document.

但是,如果文档的格式不正确,则不同的解析器将 给出不同的结果.

But if the document is not perfectly-formed, different parsers will give different results.

html.parser为我工作:

from bs4 import BeautifulSoup
import requests

document = requests.get('http://www.wvdnr.gov/').content
soup = BeautifulSoup(document, "html.parser")
print soup.find_all('a')

演示:

>>> from bs4 import BeautifulSoup
>>> import requests
>>> document = requests.get('http://www.wvdnr.gov/').content
>>>
>>> soup = BeautifulSoup(document, "html5lib")
>>> len(soup.find_all('a'))
0
>>> soup = BeautifulSoup(document, "lxml")
>>> len(soup.find_all('a'))
0
>>> soup = BeautifulSoup(document, "html.parser")
>>> len(soup.find_all('a'))
147

另请参阅:

这篇关于BeautifulSoup找不到正确分析的元素的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆