BeautifulSoup找不到正确分析的元素 [英] BeautifulSoup doesn't find correctly parsed elements

查看：133 发布时间：2020/9/20 7:57:47 python html beautifulsoup html-parsing html5lib

本文介绍了BeautifulSoup找不到正确分析的元素的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在使用BeautifulSoup解析一堆可能非常脏的HTML文档.我偶然发现了一件非常奇怪的事情.

I am using BeautifulSoup to parse a bunch of possibly very dirty HTML documents. I stumbled upon a very bizarre thing.

HTML来自以下页面: http://www.wvdnr.gov/

The HTML comes from this page: http://www.wvdnr.gov/

它包含多个错误，例如多个<html></html>，<head>之外的多个<title>，等等...

It contains multiple errors, like multiple <html></html>, <title> outside the <head>, etc...

但是，即使在这些情况下，html5lib通常也可以正常工作.实际上，当我这样做时:

However, html5lib usually works well even in these cases. In fact, when I do:

soup = BeautifulSoup(document, "html5lib")

和我的pretti-print soup，我看到以下输出: http://pastebin.com/8BKapx88

and I pretti-print soup, I see the following output: http://pastebin.com/8BKapx88

其中包含很多<a>标签.

但是，当我执行soup.find_all("a")时，我得到一个空列表.使用lxml，我会得到相同的结果.

However, when I do soup.find_all("a") I get an empty list. With lxml I get the same.

因此:以前有人在这个问题上迷迷糊糊吗?到底是怎么回事?我如何获取html5lib找到但未随find_all返回的链接?

So: has anybody stumbled on this problem before? What is going on? How do I get the links that html5lib found but isn't returning with find_all?

推荐答案

在解析格式不正确且棘手的HTML时，请

When it comes to parsing a not well-formed and tricky HTML, the parser choice is very important:

HTML解析器之间也存在差异.如果你给美丽汤一个格式完美的HTML文档，这些差异无关紧要. 一个解析器会比另一个解析器快，但是它们都会给您一个看起来与原始HTML文档完全一样的数据结构.

There are also differences between HTML parsers. If you give Beautiful Soup a perfectly-formed HTML document, these differences won’t matter. One parser will be faster than another, but they’ll all give you a data structure that looks exactly like the original HTML document.

但是，如果文档的格式不正确，则不同的解析器将给出不同的结果.

But if the document is not perfectly-formed, different parsers will give different results.

html.parser为我工作:

from bs4 import BeautifulSoup
import requests

document = requests.get('http://www.wvdnr.gov/').content
soup = BeautifulSoup(document, "html.parser")
print soup.find_all('a')

演示:

>>> from bs4 import BeautifulSoup
>>> import requests
>>> document = requests.get('http://www.wvdnr.gov/').content
>>>
>>> soup = BeautifulSoup(document, "html5lib")
>>> len(soup.find_all('a'))
0
>>> soup = BeautifulSoup(document, "lxml")
>>> len(soup.find_all('a'))
0
>>> soup = BeautifulSoup(document, "html.parser")
>>> len(soup.find_all('a'))
147

另请参阅:

解析器之间的差异.

这篇关于BeautifulSoup找不到正确分析的元素的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

BeautifulSoup找不到正确分析的元素 [英] BeautifulSoup doesn't find correctly parsed elements

问题描述

推荐答案

相关文章

前端开发最新文章

热门教程

热门工具

登录关闭

BeautifulSoup找不到正确分析的元素 [英] BeautifulSoup doesn&#39;t find correctly parsed elements

问题描述

推荐答案

相关文章

前端开发最新文章

热门教程

热门工具

登录 关闭

BeautifulSoup找不到正确分析的元素 [英] BeautifulSoup doesn't find correctly parsed elements

登录关闭