BeautifulSoup:findAll未找到标签 [英] BeautifulSoup: findAll doesn't find the tags
问题描述
对于我发布的许多问题感到抱歉,但是我不知道该错误的处理方法:测试此
I'm sorry about the many questions I post, but I have no idea what to do about this bug: when testing this page, with simple p
s
ab=soup.find("article", {"itemprop":"articleBody"})
p=ab.findAll("p")
print(len(p)) #gives 1
有很多p
标记,但是我只有第一个.
我试图将整个<article itemprop="articleBody">
html文本复制粘贴到字符串中,然后将其传递给新的BeautifulSoup
对象.在该对象中搜索p
给出了所有所需的标签(14).
There are many p
tags, but I get only the first.
I tried to copy-paste the whole <article itemprop="articleBody">
html text into a string and passed it to a new BeautifulSoup
object. Searching that object for p
gave all the desired tags (14).
为什么通常的方法行不通? p
标签是否在此处动态加载(但html代码看起来很正常)?
Why the usual approach doesn't work? Are the p
tags loaded dynamically here (but the html code looks pretty normal)?
推荐答案
问题在于解析器:
In [21]: req = requests.get("http://www.wired.com/2016/08/cape-watch-99/")
In [22]: soup = BeautifulSoup(req.content, "lxml")
In [23]: len(soup.select("article[itemprop=articleBody] p"))
Out[23]: 26
In [24]: soup = BeautifulSoup(req.content, "html.parser")
In [25]: len(soup.select("article[itemprop=articleBody] p"))
Out[25]: 1
In [26]: soup = BeautifulSoup(req.content, "html5lib")
In [27]: len(soup.select("article[itemprop=articleBody] p"))
Out[27]: 26
您可以看到 html5lib 和 lxml 获得了所有p标签,但是标准的 html.parser 也不能处理损坏的html.通过 validator.w3 来运行html文章,您会得到很多输出,尤其是:
You can see html5lib and lxml get all the p tags but the standard html.parser does not handle the broken html as well. Running the article html through validator.w3 you get a lot of output, in particular:
这篇关于BeautifulSoup:findAll未找到标签的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!