BeautifulSoup:findAll未找到标签 [英] BeautifulSoup: findAll doesn't find the tags

查看:547
本文介绍了BeautifulSoup:findAll未找到标签的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

对于我发布的许多问题感到抱歉,但是我不知道该错误的处理方法:测试此

I'm sorry about the many questions I post, but I have no idea what to do about this bug: when testing this page, with simple ps

ab=soup.find("article", {"itemprop":"articleBody"})
p=ab.findAll("p")
print(len(p))  #gives 1

有很多p标记,但是我只有第一个. 我试图将整个<article itemprop="articleBody"> html文本复制粘贴到字符串中,然后将其传递给新的BeautifulSoup对象.在该对象中搜索p给出了所有所需的标签(14).

There are many p tags, but I get only the first. I tried to copy-paste the whole <article itemprop="articleBody"> html text into a string and passed it to a new BeautifulSoup object. Searching that object for p gave all the desired tags (14).

为什么通常的方法行不通? p标签是否在此处动态加载(但html代码看起来很正常)?

Why the usual approach doesn't work? Are the p tags loaded dynamically here (but the html code looks pretty normal)?

推荐答案

问题在于解析器:

In [21]: req = requests.get("http://www.wired.com/2016/08/cape-watch-99/")

In [22]: soup = BeautifulSoup(req.content, "lxml")

In [23]: len(soup.select("article[itemprop=articleBody] p"))
Out[23]: 26

In [24]: soup = BeautifulSoup(req.content, "html.parser")

In [25]: len(soup.select("article[itemprop=articleBody] p"))
Out[25]: 1
In [26]: soup = BeautifulSoup(req.content, "html5lib")

In [27]: len(soup.select("article[itemprop=articleBody] p"))
Out[27]: 26

您可以看到 html5lib lxml 获得了所有p标签,但是标准的 html.parser 也不能处理损坏的html.通过 validator.w3 来运行html文章,您会得到很多输出,尤其是:

You can see html5lib and lxml get all the p tags but the standard html.parser does not handle the broken html as well. Running the article html through validator.w3 you get a lot of output, in particular:

这篇关于BeautifulSoup:findAll未找到标签的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆