BeautifulSoup - 通过文本标签内搜索 [英] BeautifulSoup - search by text inside a tag

查看:131
本文介绍了BeautifulSoup - 通过文本标签内搜索的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

注意以下问题:

 进口重
从BS4进口BeautifulSoup作为BS汤= BS(
< A HREF =/客户菜单/ 1 /客户/ 1 /更新>
    编辑
&所述; / A>
)#这将返回的< A>元件
soup.find(
    '一个',
    HREF =/客户菜单/ 1 /客户/ 1 /更新,
    文字= re.compile(*编辑。*)
)汤= BS(
< A HREF =/客户菜单/ 1 /客户/ 1 /更新>
    < I类=发发编辑>< I&GT /;编辑
&所述; / A>
)#这将返回无
soup.find(
    '一个',
    HREF =/客户菜单/ 1 /客户/ 1 /更新,
    文字= re.compile(*编辑。*)

由于某些原因,BeautifulSoup不匹配的文本,当< I> 标签是有作为。寻找标记和显示其文本生成

 >>> A2 = soup.find(
        '一个',
        HREF =/客户菜单/ 1 /客户/ 1 /更新
    )
>>>打印(再版(a2.text))
的'\\ n编辑\\ n'

右键。按照文档,汤使用常规的前pression,没有搜索功能的匹配功能。所以,我需要提供DOTALL标志:

 模式= re.compile('。*编辑。*)
pattern.match('\\ n编辑\\ n)#返回无模式= re.compile('。*编辑*',旗帜= re.DOTALL)
pattern.match('\\ n编辑\\ n)#返回MatchObject

好吧。看起来不错。让我们尝试将其与汤

 汤= BS(
< A HREF =/客户菜单/ 1 /客户/ 1 /更新>
    < I类=发发编辑>< I&GT /;编辑
&所述; / A>
)soup.find(
    '一个',
    HREF =/客户菜单/ 1 /客户/ 1 /更新,
    文字= re.compile(*编辑。*,旗帜= re.DOTALL)
)#还是回到无...为什么?

修改

基于geckons

我的解决办法回答:我实现了这些助手:

 进口重MATCH_ALL = R'*'
DEF喜欢(字符串):
    
    返回符合给定一个编译正前pression
    字符串与任何preFIX和后缀,例如如果字符串=你好,
    返回的正则表达式匹配R。*你好。*
    
    string_ =字符串
    如果不是isinstance(string_,STR):
        string_ = STR(string_)
    正则表达式= + MATCH_ALL re.escape(string_)+ MATCH_ALL
    返回re.compile(正则表达式,旗帜= re.DOTALL)
高清find_by_text(汤,文本,标签,** kwargs):
    
    查找汤,将所有提供kwargs标签,并包含
    文本。    如果没有找到匹配,返回None。
    如果有多个匹配,提高ValueError异常。
    
    元素= soup.find_all(标签,** kwargs)
    匹配= []
    在元素的元素:
        如果element.find(文=像(文本)):
            matches.append(元)
    如果len(匹配)> 1:
        提高ValueError错误(太多的匹配:\\ n+\\ n。加入(火柴))
    ELIF LEN(匹配)== 0:
        返回None
    其他:
        回到比赛[0]

现在,当我想找到上面的元素,我就跑 find_by_text(汤,'编辑','A'中,href ='/客户菜单/ 1 /客户/ 1 /更新')


解决方案

的问题是,你的< A> &LT标签; I> 标记中,不具有字符串属性你希望它有。首先让我们来看看什么文本=参数为找到()一样。

注:文本参数是一个古老的名字,因为BeautifulSoup 4.4.0,它被称为字符串

文档


  

尽管字符串是用于查找字符串,你可以用它结合
  该发现标签的参数:美丽的汤会发现它的所有标签
  .string您的字符串值相匹配。这code发现标签
  其.string是杜:

  soup.find_all(A,字符串=杜)
#[< A HREF =htt​​p://example.com/elsie级=姐姐ID =链接1>杜< / A>]


现在,让我们来看看什么的标签 字符串属性(来自的文档再次):


  

如果一个标签只有一个孩子,那孩子是NavigableString时,
  孩子是由作为.string:

  title_tag.string
#u'The睡鼠的故事


(...)


  

如果一个标签包含一个以上的东西,那么它不清楚是什么
  .string应参考,所以.string被定义为无:

 打印(soup.html.string)
# 没有


这正是你的情况。你的< A> 标签包含一个文本 < I> 标记。因此,发现可获得当试图搜索字符串,因此无法比拟的。

如何解决此问题?

也许有更好的解决办法,但我可能会像这样去:

 进口重
从BS4进口BeautifulSoup作为BS汤= BS(
< A HREF =/客户菜单/ 1 /客户/ 1 /更新>
    < I类=发发编辑>< I&GT /;编辑
&所述; / A>
)链接= soup.find_all('A'中,href =/客户菜单/ 1 /客户/ 1 /更新)在链接的链接:
    如果link.find(文= re.compile(编辑)):
        thelink =链接
        打破打印(thelink)

我觉得不会有太多的指向 /客户菜单/ 1 /客户/ 1 /更新所以它应该是足够快。

链接

Observe the following problem:

import re
from bs4 import BeautifulSoup as BS

soup = BS("""
<a href="/customer-menu/1/accounts/1/update">
    Edit
</a>
""")

# This returns the <a> element
soup.find(
    'a',
    href="/customer-menu/1/accounts/1/update",
    text=re.compile(".*Edit.*")
)

soup = BS("""
<a href="/customer-menu/1/accounts/1/update">
    <i class="fa fa-edit"></i> Edit
</a>
""")

# This returns None
soup.find(
    'a',
    href="/customer-menu/1/accounts/1/update",
    text=re.compile(".*Edit.*")
)

For some reason, BeautifulSoup will not match the text, when the <i> tag is there as well. Finding the tag and showing its text produces

>>> a2 = soup.find(
        'a',
        href="/customer-menu/1/accounts/1/update"
    )
>>> print(repr(a2.text))
'\n Edit\n'

Right. According to the Docs, soup uses the match function of the regular expression, not the search function. So I need to provide the DOTALL flag:

pattern = re.compile('.*Edit.*')
pattern.match('\n Edit\n')  # Returns None

pattern = re.compile('.*Edit.*', flags=re.DOTALL)
pattern.match('\n Edit\n')  # Returns MatchObject

Alright. Looks good. Let's try it with soup

soup = BS("""
<a href="/customer-menu/1/accounts/1/update">
    <i class="fa fa-edit"></i> Edit
</a>
""")

soup.find(
    'a',
    href="/customer-menu/1/accounts/1/update",
    text=re.compile(".*Edit.*", flags=re.DOTALL)
)  # Still return None... Why?!

Edit

My solution based on geckons answer: I implemented these helpers:

import re

MATCH_ALL = r'.*'


def like(string):
    """
    Return a compiled regular expression that matches the given
    string with any prefix and postfix, e.g. if string = "hello",
    the returned regex matches r".*hello.*"
    """
    string_ = string
    if not isinstance(string_, str):
        string_ = str(string_)
    regex = MATCH_ALL + re.escape(string_) + MATCH_ALL
    return re.compile(regex, flags=re.DOTALL)


def find_by_text(soup, text, tag, **kwargs):
    """
    Find the tag in soup that matches all provided kwargs, and contains the
    text.

    If no match is found, return None.
    If more than one match is found, raise ValueError.
    """
    elements = soup.find_all(tag, **kwargs)
    matches = []
    for element in elements:
        if element.find(text=like(text)):
            matches.append(element)
    if len(matches) > 1:
        raise ValueError("Too many matches:\n" + "\n".join(matches))
    elif len(matches) == 0:
        return None
    else:
        return matches[0]

Now, when I want to find the element above, I just run find_by_text(soup, 'Edit', 'a', href='/customer-menu/1/accounts/1/update')

解决方案

The problem is that your <a> tag with the <i> tag inside, doesn't have the string attribute you expect it to have. First let's take a look at what text="" argument for find() does.

NOTE: The text argument is an old name, since BeautifulSoup 4.4.0 it's called string.

From the docs:

Although string is for finding strings, you can combine it with arguments that find tags: Beautiful Soup will find all tags whose .string matches your value for string. This code finds the tags whose .string is "Elsie":

soup.find_all("a", string="Elsie")
# [<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>]

Now let's take a look what Tag's string attribute is (from the docs again):

If a tag has only one child, and that child is a NavigableString, the child is made available as .string:

title_tag.string
# u'The Dormouse's story'

(...)

If a tag contains more than one thing, then it’s not clear what .string should refer to, so .string is defined to be None:

print(soup.html.string)
# None

This is exactly your case. Your <a> tag contains a text and <i> tag. Therefore, the find gets None when trying to search for a string and thus it can't match.

How to solve this?

Maybe there is a better solution but I would probably go with something like this:

import re
from bs4 import BeautifulSoup as BS

soup = BS("""
<a href="/customer-menu/1/accounts/1/update">
    <i class="fa fa-edit"></i> Edit
</a>
""")

links = soup.find_all('a', href="/customer-menu/1/accounts/1/update")

for link in links:
    if link.find(text=re.compile("Edit")):
        thelink = link
        break

print(thelink)

I think there are not too many links pointing to /customer-menu/1/accounts/1/update so it should be fast enough.

这篇关于BeautifulSoup - 通过文本标签内搜索的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆