Beautifulsoup如何找到所有工作 [英] Beautifulsoup how does findAll work

查看:82
本文介绍了Beautifulsoup如何找到所有工作的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我注意到findAll方法的一些奇怪行为:

I've noticed some weird behavior of findAll's method:

>>> htmls="<html><body><p class=\"pagination-container\">slytherin</p><p class=\"pagination-container and something\">gryffindor</p></body></html>"
>>> soup=BeautifulSoup(htmls, "html.parser")
>>> for i in soup.findAll("p",{"class":"pagination-container"}):
    print(i.text)


slytherin
gryffindor
>>> for i in soup.findAll("p", {"class":"pag"}):
    print(i.text)


>>> for i in soup.findAll("p",{"class":"pagination-container"}):
    print(i.text)


slytherin
gryffindor
>>> for i in soup.findAll("p",{"class":"pagination"}):
    print(i.text)


>>> len(soup.findAll("p",{"class":"pagination-container"}))
2
>>> len(soup.findAll("p",{"class":"pagination-containe"}))
0
>>> len(soup.findAll("p",{"class":"pagination-contai"}))
0
>>> len(soup.findAll("p",{"class":"pagination-container and something"}))
1
>>> len(soup.findAll("p",{"class":"pagination-conta"}))
0

因此,当我们搜索pagination-container时,它将同时返回第一个和第二个p标记.它使我认为它需要部分相等:类似于if passed_string in class_attribute_value:.因此,我在findAll方法中缩短了字符串,但它从未找到任何东西!

So, when we search for pagination-container it returns both the first and the second p tag. It made me think that it looks for a partial equality: something like if passed_string in class_attribute_value:. So I shortened the string in findAll method and it never managed to find anything!

那怎么可能?

推荐答案

首先,多值空格分隔属性,并且具有特殊的处理方式.

First of all, class is a special multi-valued space-delimited attribute and has a special handling.

编写soup.findAll("p", {"class":"pag"})时,BeautifulSoup将搜索具有类pag的元素.它将按空间划分元素类的值,并检查所划分的项中是否存在pag.如果您的元素具有class="test pag"class="pag",它将被匹配.

When you write soup.findAll("p", {"class":"pag"}), BeautifulSoup would search for elements having class pag. It would split element class value by space and check if there is pag among the splitted items. If you had an element with class="test pag" or class="pag", it would be matched.

请注意,在soup.findAll("p", {"class": "pagination-container and something"})的情况下,BeautifulSoup将匹配具有精确的class属性值的元素.在这种情况下不涉及拆分-只是看到存在一个元素,其中完整的class值等于所需的字符串.

Note that in case of soup.findAll("p", {"class": "pagination-container and something"}), BeautifulSoup would match an element having the exact class attribute value. There is no splitting involved in this case - it just sees that there is an element where the complete class value equals the desired string.

要在其中一个类别上实现部分匹配,您可以提供

To have a partial match on one of the classes, you can provide a regular expression or a function as a class filter value:

import re

soup.find_all("p", {"class": re.compile(r"pag")})  # contains pag
soup.find_all("p", {"class": re.compile(r"^pag")})  # starts with pag

soup.find_all("p", {"class": lambda class_: class_ and "pag" in class_})  # contains pag
soup.find_all("p", {"class": lambda class_: class_ and class_.startswith("pag")})  # starts with pag


还有很多话要说,但是您还应该知道BeautifulSoup具有


There is much more to say, but you should also know that BeautifulSoup has CSS selector support (a limited one but covers most of the common use cases). You can write things like:

soup.select("p.pagination-container")  # one of the classes is "pagination-container"
soup.select("p[class='pagination-container']")  # match the COMPLETE class attribute value
soup.select("p[class^=pag]")  # COMPLETE class attribute value starts with pag


BeautifulSoup中处理class属性值是造成混淆和问题的常见原因,请参阅以下相关主题以获取更多了解:


Handling class attribute values in BeautifulSoup is a common source of confusion and questions, please see these related topics to gain more understanding:

  • BeautifulSoup returns empty list when searching by compound class names
  • Finding multiple attributes within the span tag in Python

这篇关于Beautifulsoup如何找到所有工作的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆