Beautifulsoup如何找到所有工作 [英] Beautifulsoup how does findAll work
问题描述
我注意到findAll
方法的一些奇怪行为:
I've noticed some weird behavior of findAll
's method:
>>> htmls="<html><body><p class=\"pagination-container\">slytherin</p><p class=\"pagination-container and something\">gryffindor</p></body></html>"
>>> soup=BeautifulSoup(htmls, "html.parser")
>>> for i in soup.findAll("p",{"class":"pagination-container"}):
print(i.text)
slytherin
gryffindor
>>> for i in soup.findAll("p", {"class":"pag"}):
print(i.text)
>>> for i in soup.findAll("p",{"class":"pagination-container"}):
print(i.text)
slytherin
gryffindor
>>> for i in soup.findAll("p",{"class":"pagination"}):
print(i.text)
>>> len(soup.findAll("p",{"class":"pagination-container"}))
2
>>> len(soup.findAll("p",{"class":"pagination-containe"}))
0
>>> len(soup.findAll("p",{"class":"pagination-contai"}))
0
>>> len(soup.findAll("p",{"class":"pagination-container and something"}))
1
>>> len(soup.findAll("p",{"class":"pagination-conta"}))
0
因此,当我们搜索pagination-container
时,它将同时返回第一个和第二个p
标记.它使我认为它需要部分相等:类似于if passed_string in class_attribute_value:
.因此,我在findAll
方法中缩短了字符串,但它从未找到任何东西!
So, when we search for pagination-container
it returns both the first and the second p
tag. It made me think that it looks for a partial equality: something like if passed_string in class_attribute_value:
. So I shortened the string in findAll
method and it never managed to find anything!
那怎么可能?
推荐答案
首先,多值空格分隔属性,并且具有特殊的处理方式.
First of all, class
is a special multi-valued space-delimited attribute and has a special handling.
编写soup.findAll("p", {"class":"pag"})
时,BeautifulSoup
将搜索具有类pag
的元素.它将按空间划分元素类的值,并检查所划分的项中是否存在pag
.如果您的元素具有class="test pag"
或class="pag"
,它将被匹配.
When you write soup.findAll("p", {"class":"pag"})
, BeautifulSoup
would search for elements having class pag
. It would split element class value by space and check if there is pag
among the splitted items. If you had an element with class="test pag"
or class="pag"
, it would be matched.
请注意,在soup.findAll("p", {"class": "pagination-container and something"})
的情况下,BeautifulSoup
将匹配具有精确的class
属性值的元素.在这种情况下不涉及拆分-只是看到存在一个元素,其中完整的class
值等于所需的字符串.
Note that in case of soup.findAll("p", {"class": "pagination-container and something"})
, BeautifulSoup
would match an element having the exact class
attribute value. There is no splitting involved in this case - it just sees that there is an element where the complete class
value equals the desired string.
要在其中一个类别上实现部分匹配,您可以提供一个函数作为类过滤器值:
To have a partial match on one of the classes, you can provide a regular expression or a function as a class filter value:
import re
soup.find_all("p", {"class": re.compile(r"pag")}) # contains pag
soup.find_all("p", {"class": re.compile(r"^pag")}) # starts with pag
soup.find_all("p", {"class": lambda class_: class_ and "pag" in class_}) # contains pag
soup.find_all("p", {"class": lambda class_: class_ and class_.startswith("pag")}) # starts with pag
还有很多话要说,但是您还应该知道BeautifulSoup
具有
There is much more to say, but you should also know that BeautifulSoup
has CSS selector support (a limited one but covers most of the common use cases). You can write things like:
soup.select("p.pagination-container") # one of the classes is "pagination-container"
soup.select("p[class='pagination-container']") # match the COMPLETE class attribute value
soup.select("p[class^=pag]") # COMPLETE class attribute value starts with pag
在BeautifulSoup
中处理class
属性值是造成混淆和问题的常见原因,请参阅以下相关主题以获取更多了解:
Handling class
attribute values in BeautifulSoup
is a common source of confusion and questions, please see these related topics to gain more understanding:
- BeautifulSoup returns empty list when searching by compound class names
- Finding multiple attributes within the span tag in Python
这篇关于Beautifulsoup如何找到所有工作的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!