BeautifulSoup webscraping find_all():找到完全匹配 [英] BeautifulSoup webscraping find_all( ): finding exact match

查看:34
本文介绍了BeautifulSoup webscraping find_all():找到完全匹配的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我使用 Python 和 BeautifulSoup 进行网页抓取.

假设我有以下 html 代码要抓取:

<div class="product">产品 1</div><div class="product">产品 2</div><div class="product special">产品 3</div><div class="product special">产品 4</div>

使用 BeautifulSoup,我只想找到属性 class="product" 的产品(仅限产品 1 和 2),而不是特殊"产品

如果我执行以下操作:

result = soup.find_all('div', {'class': 'product'})

结果包括所有产品(1、2、3 和 4).

我应该怎么做才能找到类别与产品"完全匹配的产品?

<小时>

我运行的代码:

from bs4 import BeautifulSoup进口重新文字 = """<身体><div class="product">产品 1</div><div class="product">产品 2</div><div class="product special">产品 3</div><div class="product special">产品 4</div></body>"""汤 = BeautifulSoup(文本)结果 = 汤.findAll(attrs={'class': re.compile(r"^product$")})打印结果

输出:

[<div class="product">Product 1</div>, <div class="product">Product 2</div>, <div class="product special">产品 3</div>,<div class="product special">产品 4</div>]

解决方案

在 BeautifulSoup 4 中,class 属性(以及其他几个属性,例如 accesskey 和 <表格单元格元素上的 code>headers 属性)被视为一个集合;您与属性中列出的各个元素进行匹配.这遵循 HTML 标准.

因此,您不能将搜索限制为仅一类.

您必须使用自定义函数在这里与班级匹配:

result = soup.find_all(lambda tag: tag.name == 'div' andtag.get('class') == ['product'])

我使用了一个 lambda 来创建一个匿名函数;每个标签在名称上匹配(必须是'div'),并且class属性必须完全等于列表['product'];例如只有一个值.

演示:

<预><代码>>>>从 bs4 导入 BeautifulSoup>>>文字 = """... <身体>... <div class="product">产品 1</div>... <div class="product">产品 2</div>... <div class="product special">产品 3</div>... <div class="product special">产品 4</div>... </body>""">>>汤 = BeautifulSoup(文本)>>>汤.find_all(lambda 标签: tag.name == 'div' and tag.get('class') == ['product'])[<div class="product">Product 1</div>, <div class="product">Product 2</div>]

为了完整起见,这里有所有这些设置属性,来自 BeautifulSoup 源代码:

# HTML 标准将这些属性定义为包含一个# 空格分隔的值列表,而不是单个值.那是,# class="foo bar" 表示 'class' 属性有两个值,# 'foo' 和 'bar',不是单一值 'foo bar'.什么时候我们# 遇到这些属性之一,我们将其值解析为# 如果可能,列出值.输出后,列表将是# 转换回字符串.cdata_list_attributes = {"*" : ['class', 'accesskey', 'dropzone'],"a" : ['rel', 'rev'],链接":['rel','rev'],td":[标题"],th":[标题"],td":[标题"],形式":[接受字符集"],对象":[档案"],# 这些是 HTML5 特定的,就像上面的 *.accesskey 和 *.dropzone 一样.区域":[rel"],图标":[尺寸"],iframe":[沙盒"],输出":[为"],}

I'm using Python and BeautifulSoup for web scraping.

Lets say I have the following html code to scrape:

<body>
    <div class="product">Product 1</div>
    <div class="product">Product 2</div>
    <div class="product special">Product 3</div>
    <div class="product special">Product 4</div>
</body>

Using BeautifulSoup, I want to find ONLY the products with the attribute class="product" (only Product 1 and 2), not the 'special' products

If I do the following:

result = soup.find_all('div', {'class': 'product'})

the result includes ALL the products (1,2,3, and 4).

What should I do to find products whose class EXACTLY matches 'product'??


The Code I ran:

from bs4 import BeautifulSoup
import re

text = """
<body>
    <div class="product">Product 1</div>
    <div class="product">Product 2</div>
    <div class="product special">Product 3</div>
    <div class="product special">Product 4</div>
</body>"""

soup = BeautifulSoup(text)
result = soup.findAll(attrs={'class': re.compile(r"^product$")})
print result

Output:

[<div class="product">Product 1</div>, <div class="product">Product 2</div>, <div class="product special">Product 3</div>, <div class="product special">Product 4</div>]

解决方案

In BeautifulSoup 4, the class attribute (and several other attributes, such as accesskey and the headers attribute on table cell elements) is treated as a set; you match against individual elements listed in the attribute. This follows the HTML standard.

As such, you cannot limit the search to just one class.

You'll have to use a custom function here to match against the class instead:

result = soup.find_all(lambda tag: tag.name == 'div' and 
                                   tag.get('class') == ['product'])

I used a lambda to create an anonymous function; each tag is matched on name (must be 'div'), and the class attribute must be exactly equal to the list ['product']; e.g. have just the one value.

Demo:

>>> from bs4 import BeautifulSoup
>>> text = """
... <body>
...     <div class="product">Product 1</div>
...     <div class="product">Product 2</div>
...     <div class="product special">Product 3</div>
...     <div class="product special">Product 4</div>
... </body>"""
>>> soup = BeautifulSoup(text)
>>> soup.find_all(lambda tag: tag.name == 'div' and tag.get('class') == ['product'])
[<div class="product">Product 1</div>, <div class="product">Product 2</div>]

For completeness sake, here are all such set attributes, from the BeautifulSoup source code:

# The HTML standard defines these attributes as containing a
# space-separated list of values, not a single value. That is,
# class="foo bar" means that the 'class' attribute has two values,
# 'foo' and 'bar', not the single value 'foo bar'.  When we
# encounter one of these attributes, we will parse its value into
# a list of values if possible. Upon output, the list will be
# converted back into a string.
cdata_list_attributes = {
    "*" : ['class', 'accesskey', 'dropzone'],
    "a" : ['rel', 'rev'],
    "link" :  ['rel', 'rev'],
    "td" : ["headers"],
    "th" : ["headers"],
    "td" : ["headers"],
    "form" : ["accept-charset"],
    "object" : ["archive"],

    # These are HTML5 specific, as are *.accesskey and *.dropzone above.
    "area" : ["rel"],
    "icon" : ["sizes"],
    "iframe" : ["sandbox"],
    "output" : ["for"],
    }

这篇关于BeautifulSoup webscraping find_all():找到完全匹配的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆