BeautifulSoup webscraping find_all():找到精确匹配 [英] BeautifulSoup webscraping find_all( ): finding exact match
问题描述
我使用Python和BeautifulSoup的网页抓取。
可以说我有以下的HTML code刮:
<身体GT;
< DIV CLASS =产品>产品1 LT; / DIV>
< DIV CLASS =产品>产品2'; / DIV>
< DIV CLASS =特殊产品>产品3'; / DIV>
< DIV CLASS =特殊产品>产品4℃; / DIV>
< /身体GT;
使用BeautifulSoup,我只想要查找的属性类=产品的产品
(只产品1和2),而不是特殊的产品
如果我做到以下几点:
结果= soup.find_all('格',{'类':'产品'})
的结果,包括所有的产品(1,2,3,和4)。
我应该怎么做才能找到产品,其类完全匹配产品??
在code,我跑:
从BS4进口BeautifulSoup
进口重文字=
<身体GT;
< DIV CLASS =产品>产品1 LT; / DIV>
< DIV CLASS =产品>产品2'; / DIV>
< DIV CLASS =特殊产品>产品3'; / DIV>
< DIV CLASS =特殊产品>产品4℃; / DIV>
< /身体GT;汤= BeautifulSoup(文本)
结果= soup.findAll(ATTRS = {'类':re.compile(r^ $产品)})
打印结果
输出:
[< DIV CLASS =产品>产品1 LT; / DIV>中< DIV CLASS =产品>产品2'; / DIV>中< DIV类=特殊产品>产品3'; / DIV>中< DIV CLASS =特殊产品>产品4℃; / DIV>]
在BeautifulSoup 4,类
属性(和其他几个属性,如 ACCESSKEY
和表格单元元素标题
属性)被视为一组;你匹配的属性中列出的各个元素。这是继HTML标准。
因此,你不能将搜索范围限制到只有一个类。
您将不得不使用自定义函数的位置相匹配反对类,而不是:
结果= soup.find_all(拉姆达标签:tag.name =='格'和
tag.get(类)== ['产品'])
我用了一个的λ
来创建一个匿名函数;每个标签上的名称相匹配(必须'格'
),类属性必须是正好等于列表 ['产品']
;例如刚才的一个值。
演示:
>>>从BS4进口BeautifulSoup
>>>文字=
...<身体GT;
...< DIV CLASS =产品>产品1 LT; / DIV>
...< DIV CLASS =产品>产品2'; / DIV>
...< DIV CLASS =特殊产品>产品3'; / DIV>
...< DIV CLASS =特殊产品>产品4℃; / DIV>
...< /身体GT;
>>>汤= BeautifulSoup(文本)
>>> soup.find_all(拉姆达标签:tag.name =='格'和tag.get(类)== ['产品'])
[< DIV CLASS =产品>产品1 LT; / DIV>中< DIV CLASS =产品>产品2'; / DIV>]
为了完整起见,这里都是这样的属性集,从BeautifulSoup源$ C $ C:
#HTML标准定义了这些属性为包含
值#空格分隔的列表,而不是单个值。那是,
#类=富巴的意思是class属性有两个值,
#'富'和'酒吧',而不是单个值'富巴。什么时候我们
#遭遇这些属性之一,我们将解析它的值写入
#如果可能值的列表。在输出时,该列表会
#转换回字符串。
cdata_list_attributes = {
*:['类','快捷键','悬浮窗'],
一:[REL','转'],
链接:['相对','转'],
TD:[标题],
TH:[标题],
TD:[标题],
形:接收字符集],
对象:[档案] #这是HTML5具体,因为是* .accesskey以上* .dropzone。
区域:相对],
图标:大小],
IFRAME:沙箱]
输出:[为],
}
I'm using Python and BeautifulSoup for web scraping.
Lets say I have the following html code to scrape:
<body>
<div class="product">Product 1</div>
<div class="product">Product 2</div>
<div class="product special">Product 3</div>
<div class="product special">Product 4</div>
</body>
Using BeautifulSoup, I want to find ONLY the products with the attribute class="product" (only Product 1 and 2), not the 'special' products
If I do the following:
result = soup.find_all('div', {'class': 'product'})
the result includes ALL the products (1,2,3, and 4).
What should I do to find products whose class EXACTLY matches 'product'??
The Code I ran:
from bs4 import BeautifulSoup
import re
text = """
<body>
<div class="product">Product 1</div>
<div class="product">Product 2</div>
<div class="product special">Product 3</div>
<div class="product special">Product 4</div>
</body>"""
soup = BeautifulSoup(text)
result = soup.findAll(attrs={'class': re.compile(r"^product$")})
print result
Output:
[<div class="product">Product 1</div>, <div class="product">Product 2</div>, <div class="product special">Product 3</div>, <div class="product special">Product 4</div>]
In BeautifulSoup 4, the class
attribute (and several other attributes, such as accesskey
and the headers
attribute on table cell elements) is treated as a set; you match against individual elements listed in the attribute. This follows the HTML standard.
As such, you cannot limit the search to just one class.
You'll have to use a custom function here to match against the class instead:
result = soup.find_all(lambda tag: tag.name == 'div' and
tag.get('class') == ['product'])
I used a lambda
to create an anonymous function; each tag is matched on name (must be 'div'
), and the class attribute must be exactly equal to the list ['product']
; e.g. have just the one value.
Demo:
>>> from bs4 import BeautifulSoup
>>> text = """
... <body>
... <div class="product">Product 1</div>
... <div class="product">Product 2</div>
... <div class="product special">Product 3</div>
... <div class="product special">Product 4</div>
... </body>"""
>>> soup = BeautifulSoup(text)
>>> soup.find_all(lambda tag: tag.name == 'div' and tag.get('class') == ['product'])
[<div class="product">Product 1</div>, <div class="product">Product 2</div>]
For completeness sake, here are all such set attributes, from the BeautifulSoup source code:
# The HTML standard defines these attributes as containing a
# space-separated list of values, not a single value. That is,
# class="foo bar" means that the 'class' attribute has two values,
# 'foo' and 'bar', not the single value 'foo bar'. When we
# encounter one of these attributes, we will parse its value into
# a list of values if possible. Upon output, the list will be
# converted back into a string.
cdata_list_attributes = {
"*" : ['class', 'accesskey', 'dropzone'],
"a" : ['rel', 'rev'],
"link" : ['rel', 'rev'],
"td" : ["headers"],
"th" : ["headers"],
"td" : ["headers"],
"form" : ["accept-charset"],
"object" : ["archive"],
# These are HTML5 specific, as are *.accesskey and *.dropzone above.
"area" : ["rel"],
"icon" : ["sizes"],
"iframe" : ["sandbox"],
"output" : ["for"],
}
这篇关于BeautifulSoup webscraping find_all():找到精确匹配的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!