BeautifulSoup webscraping find_all（）：找到精确匹配 [英] BeautifulSoup webscraping find_all( ): finding exact match

查看：652 发布时间：2016/8/5 18:54:50 python html regex web-scraping beautifulsoup

本文介绍了BeautifulSoup webscraping find_all（）：找到精确匹配的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我使用Python和BeautifulSoup的网页抓取。

可以说我有以下的HTML code刮：

 ＆LT;身体GT;
    ＆LT; DIV CLASS =产品＆GT;产品1 LT; / DIV＆GT;
    ＆LT; DIV CLASS =产品＆GT;产品2'; / DIV＆GT;
    ＆LT; DIV CLASS =特殊产品＆GT;产品3'; / DIV＆GT;
    ＆LT; DIV CLASS =特殊产品＆GT;产品4℃; / DIV＆GT;
＆LT; /身体GT;

使用BeautifulSoup，我只想要查找的属性类=产品的产品
（只产品1和2），而不是特殊的产品

如果我做到以下几点：

 结果= soup.find_all（'格'，{'类'：'产品'}）

的结果，包括所有的产品（1,2,3，和4）。

我应该怎么做才能找到产品，其类完全匹配产品??

在code，我跑：

 从BS4进口BeautifulSoup
进口重文字=
＆LT;身体GT;
    ＆LT; DIV CLASS =产品＆GT;产品1 LT; / DIV＆GT;
    ＆LT; DIV CLASS =产品＆GT;产品2'; / DIV＆GT;
    ＆LT; DIV CLASS =特殊产品＆GT;产品3'; / DIV＆GT;
    ＆LT; DIV CLASS =特殊产品＆GT;产品4℃; / DIV＆GT;
＆LT; /身体GT;汤= BeautifulSoup（文本）
结果= soup.findAll（ATTRS = {'类'：re.compile（r^ $产品）}）
打印结果

输出：

  [＆LT; DIV CLASS =产品＆GT;产品1 LT; / DIV＆gt;中＆LT; DIV CLASS =产品＆GT;产品2'; / DIV＆gt;中＆LT; DIV类=特殊产品＆GT;产品3'; / DIV＆gt;中＆LT; DIV CLASS =特殊产品＆GT;产品4℃; / DIV＆GT;]

解决方案

在BeautifulSoup 4，类属性（和其他几个属性，如 ACCESSKEY 和表格单元元素标题属性）被视为一组;你匹配的属性中列出的各个元素。这是继HTML标准。

因此，你不能将搜索范围限制到只有一个类。

您将不得不使用自定义函数的位置相匹配反对类，而不是：

 结果= soup.find_all（拉姆达标签：tag.name =='格'和
                                   tag.get（类）== ['产品']）

我用了一个的λ来创建一个匿名函数;每个标签上的名称相匹配（必须'格'），类属性必须是正好等于列表 ['产品'] ;例如刚才的一个值。

演示：

 ＆GT;＆GT;＆GT;从BS4进口BeautifulSoup
＆GT;＆GT;＆GT;文字=
...＆LT;身体GT;
...＆LT; DIV CLASS =产品＆GT;产品1 LT; / DIV＆GT;
...＆LT; DIV CLASS =产品＆GT;产品2'; / DIV＆GT;
...＆LT; DIV CLASS =特殊产品＆GT;产品3'; / DIV＆GT;
...＆LT; DIV CLASS =特殊产品＆GT;产品4℃; / DIV＆GT;
...＆LT; /身体GT;
＆GT;＆GT;＆GT;汤= BeautifulSoup（文本）
＆GT;＆GT;＆GT; soup.find_all（拉姆达标签：tag.name =='格'和tag.get（类）== ['产品']）
[＆LT; DIV CLASS =产品＆GT;产品1 LT; / DIV＆gt;中＆LT; DIV CLASS =产品＆GT;产品2'; / DIV＆GT;]

为了完整起见，这里都是这样的属性集，从BeautifulSoup源$ C $ C：

 ＃HTML标准定义了这些属性为包含
值＃空格分隔的列表，而不是单个值。那是，
＃类=富巴的意思是class属性有两个值，
＃'富'和'酒吧'，而不是单个值'富巴。什么时候我们
＃遭遇这些属性之一，我们将解析它的值写入
＃如果可能值的列表。在输出时，该列表会
＃转换回字符串。
cdata_list_attributes = {
    *：['类'，'快捷键'，'悬浮窗']，
    一：[REL'，'转']，
    链接：['相对'，'转']，
    TD：[标题]，
    TH：[标题]，
    TD：[标题]，
    形：接收字符集]，
    对象：[档案]    ＃这是HTML5具体，因为是* .accesskey以上* .dropzone。
    区域：相对]，
    图标：大小]，
    IFRAME：沙箱]
    输出：[为]，
    }

I'm using Python and BeautifulSoup for web scraping.

Lets say I have the following html code to scrape:

<body>
    <div class="product">Product 1</div>
    <div class="product">Product 2</div>
    <div class="product special">Product 3</div>
    <div class="product special">Product 4</div>
</body>

Using BeautifulSoup, I want to find ONLY the products with the attribute class="product" (only Product 1 and 2), not the 'special' products

If I do the following:

result = soup.find_all('div', {'class': 'product'})

the result includes ALL the products (1,2,3, and 4).

What should I do to find products whose class EXACTLY matches 'product'??

The Code I ran:

from bs4 import BeautifulSoup
import re

text = """
<body>
    <div class="product">Product 1</div>
    <div class="product">Product 2</div>
    <div class="product special">Product 3</div>
    <div class="product special">Product 4</div>
</body>"""

soup = BeautifulSoup(text)
result = soup.findAll(attrs={'class': re.compile(r"^product$")})
print result

Output:

[<div class="product">Product 1</div>, <div class="product">Product 2</div>, <div class="product special">Product 3</div>, <div class="product special">Product 4</div>]

解决方案

In BeautifulSoup 4, the class attribute (and several other attributes, such as accesskey and the headers attribute on table cell elements) is treated as a set; you match against individual elements listed in the attribute. This follows the HTML standard.

As such, you cannot limit the search to just one class.

You'll have to use a custom function here to match against the class instead:

result = soup.find_all(lambda tag: tag.name == 'div' and 
                                   tag.get('class') == ['product'])

I used a lambda to create an anonymous function; each tag is matched on name (must be 'div'), and the class attribute must be exactly equal to the list ['product']; e.g. have just the one value.

Demo:

>>> from bs4 import BeautifulSoup
>>> text = """
... <body>
...     <div class="product">Product 1</div>
...     <div class="product">Product 2</div>
...     <div class="product special">Product 3</div>
...     <div class="product special">Product 4</div>
... </body>"""
>>> soup = BeautifulSoup(text)
>>> soup.find_all(lambda tag: tag.name == 'div' and tag.get('class') == ['product'])
[<div class="product">Product 1</div>, <div class="product">Product 2</div>]

For completeness sake, here are all such set attributes, from the BeautifulSoup source code:

# The HTML standard defines these attributes as containing a
# space-separated list of values, not a single value. That is,
# class="foo bar" means that the 'class' attribute has two values,
# 'foo' and 'bar', not the single value 'foo bar'.  When we
# encounter one of these attributes, we will parse its value into
# a list of values if possible. Upon output, the list will be
# converted back into a string.
cdata_list_attributes = {
    "*" : ['class', 'accesskey', 'dropzone'],
    "a" : ['rel', 'rev'],
    "link" :  ['rel', 'rev'],
    "td" : ["headers"],
    "th" : ["headers"],
    "td" : ["headers"],
    "form" : ["accept-charset"],
    "object" : ["archive"],

    # These are HTML5 specific, as are *.accesskey and *.dropzone above.
    "area" : ["rel"],
    "icon" : ["sizes"],
    "iframe" : ["sandbox"],
    "output" : ["for"],
    }

这篇关于BeautifulSoup webscraping find_all（）：找到精确匹配的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

BeautifulSoup webscraping find_all（）：找到精确匹配 [英] BeautifulSoup webscraping find_all( ): finding exact match

问题描述

相关文章

前端开发最新文章

热门教程

热门工具

登录关闭

BeautifulSoup webscraping find_all（）：找到精确匹配 [英] BeautifulSoup webscraping find_all( ): finding exact match

问题描述

相关文章

前端开发最新文章

热门教程

热门工具

登录 关闭

登录关闭