BeautifulSoup - 通过文本标签内搜索 [英] BeautifulSoup - search by text inside a tag
问题描述
注意以下问题:
进口重
从BS4进口BeautifulSoup作为BS汤= BS(
< A HREF =/客户菜单/ 1 /客户/ 1 /更新>
编辑
&所述; / A>
)#这将返回的< A>元件
soup.find(
'一个',
HREF =/客户菜单/ 1 /客户/ 1 /更新,
文字= re.compile(*编辑。*)
)汤= BS(
< A HREF =/客户菜单/ 1 /客户/ 1 /更新>
< I类=发发编辑>< I&GT /;编辑
&所述; / A>
)#这将返回无
soup.find(
'一个',
HREF =/客户菜单/ 1 /客户/ 1 /更新,
文字= re.compile(*编辑。*)
)
由于某些原因,BeautifulSoup不匹配的文本,当< I>
标签是有作为。寻找标记和显示其文本生成
>>> A2 = soup.find(
'一个',
HREF =/客户菜单/ 1 /客户/ 1 /更新
)
>>>打印(再版(a2.text))
的'\\ n编辑\\ n'
右键。按照文档,汤使用常规的前pression,没有搜索功能的匹配功能。所以,我需要提供DOTALL标志:
模式= re.compile('。*编辑。*)
pattern.match('\\ n编辑\\ n)#返回无模式= re.compile('。*编辑*',旗帜= re.DOTALL)
pattern.match('\\ n编辑\\ n)#返回MatchObject
好吧。看起来不错。让我们尝试将其与汤
汤= BS(
< A HREF =/客户菜单/ 1 /客户/ 1 /更新>
< I类=发发编辑>< I&GT /;编辑
&所述; / A>
)soup.find(
'一个',
HREF =/客户菜单/ 1 /客户/ 1 /更新,
文字= re.compile(*编辑。*,旗帜= re.DOTALL)
)#还是回到无...为什么?
修改
基于geckons我的解决办法回答:我实现了这些助手:
进口重MATCH_ALL = R'*'
DEF喜欢(字符串):
返回符合给定一个编译正前pression
字符串与任何preFIX和后缀,例如如果字符串=你好,
返回的正则表达式匹配R。*你好。*
string_ =字符串
如果不是isinstance(string_,STR):
string_ = STR(string_)
正则表达式= + MATCH_ALL re.escape(string_)+ MATCH_ALL
返回re.compile(正则表达式,旗帜= re.DOTALL)
高清find_by_text(汤,文本,标签,** kwargs):
查找汤,将所有提供kwargs标签,并包含
文本。 如果没有找到匹配,返回None。
如果有多个匹配,提高ValueError异常。
元素= soup.find_all(标签,** kwargs)
匹配= []
在元素的元素:
如果element.find(文=像(文本)):
matches.append(元)
如果len(匹配)> 1:
提高ValueError错误(太多的匹配:\\ n+\\ n。加入(火柴))
ELIF LEN(匹配)== 0:
返回None
其他:
回到比赛[0]
现在,当我想找到上面的元素,我就跑 find_by_text(汤,'编辑','A'中,href ='/客户菜单/ 1 /客户/ 1 /更新')
的问题是,你的< A>
与&LT标签; I>
标记中,不具有字符串
属性你希望它有。首先让我们来看看什么文本=
参数为找到()
一样。
注:文本
参数是一个古老的名字,因为BeautifulSoup 4.4.0,它被称为字符串
。 p>
从文档:
尽管字符串是用于查找字符串,你可以用它结合
该发现标签的参数:美丽的汤会发现它的所有标签
.string您的字符串值相匹配。这code发现标签
其.string是杜:soup.find_all(A,字符串=杜)
#[< A HREF =http://example.com/elsie级=姐姐ID =链接1>杜< / A>]
块引用>现在,让我们来看看什么
的标签
字符串
属性(来自的文档再次):
如果一个标签只有一个孩子,那孩子是NavigableString时,
孩子是由作为.string:title_tag.string
#u'The睡鼠的故事
块引用>(...)
如果一个标签包含一个以上的东西,那么它不清楚是什么
.string应参考,所以.string被定义为无:打印(soup.html.string)
# 没有
块引用>这正是你的情况。你的
< A>
标签包含一个文本和< I>
标记。因此,发现可获得无
当试图搜索字符串,因此无法比拟的。如何解决此问题?
也许有更好的解决办法,但我可能会像这样去:
进口重
从BS4进口BeautifulSoup作为BS汤= BS(
< A HREF =/客户菜单/ 1 /客户/ 1 /更新>
< I类=发发编辑>< I&GT /;编辑
&所述; / A>
)链接= soup.find_all('A'中,href =/客户菜单/ 1 /客户/ 1 /更新)在链接的链接:
如果link.find(文= re.compile(编辑)):
thelink =链接
打破打印(thelink)我觉得不会有太多的指向
链接/客户菜单/ 1 /客户/ 1 /更新
所以它应该是足够快。Observe the following problem:
import re from bs4 import BeautifulSoup as BS soup = BS(""" <a href="/customer-menu/1/accounts/1/update"> Edit </a> """) # This returns the <a> element soup.find( 'a', href="/customer-menu/1/accounts/1/update", text=re.compile(".*Edit.*") ) soup = BS(""" <a href="/customer-menu/1/accounts/1/update"> <i class="fa fa-edit"></i> Edit </a> """) # This returns None soup.find( 'a', href="/customer-menu/1/accounts/1/update", text=re.compile(".*Edit.*") )
For some reason, BeautifulSoup will not match the text, when the
<i>
tag is there as well. Finding the tag and showing its text produces>>> a2 = soup.find( 'a', href="/customer-menu/1/accounts/1/update" ) >>> print(repr(a2.text)) '\n Edit\n'
Right. According to the Docs, soup uses the match function of the regular expression, not the search function. So I need to provide the DOTALL flag:
pattern = re.compile('.*Edit.*') pattern.match('\n Edit\n') # Returns None pattern = re.compile('.*Edit.*', flags=re.DOTALL) pattern.match('\n Edit\n') # Returns MatchObject
Alright. Looks good. Let's try it with soup
soup = BS(""" <a href="/customer-menu/1/accounts/1/update"> <i class="fa fa-edit"></i> Edit </a> """) soup.find( 'a', href="/customer-menu/1/accounts/1/update", text=re.compile(".*Edit.*", flags=re.DOTALL) ) # Still return None... Why?!
Edit
My solution based on geckons answer: I implemented these helpers:
import re MATCH_ALL = r'.*' def like(string): """ Return a compiled regular expression that matches the given string with any prefix and postfix, e.g. if string = "hello", the returned regex matches r".*hello.*" """ string_ = string if not isinstance(string_, str): string_ = str(string_) regex = MATCH_ALL + re.escape(string_) + MATCH_ALL return re.compile(regex, flags=re.DOTALL) def find_by_text(soup, text, tag, **kwargs): """ Find the tag in soup that matches all provided kwargs, and contains the text. If no match is found, return None. If more than one match is found, raise ValueError. """ elements = soup.find_all(tag, **kwargs) matches = [] for element in elements: if element.find(text=like(text)): matches.append(element) if len(matches) > 1: raise ValueError("Too many matches:\n" + "\n".join(matches)) elif len(matches) == 0: return None else: return matches[0]
Now, when I want to find the element above, I just run
find_by_text(soup, 'Edit', 'a', href='/customer-menu/1/accounts/1/update')
解决方案The problem is that your
<a>
tag with the<i>
tag inside, doesn't have thestring
attribute you expect it to have. First let's take a look at whattext=""
argument forfind()
does.NOTE: The
text
argument is an old name, since BeautifulSoup 4.4.0 it's calledstring
.From the docs:
Although string is for finding strings, you can combine it with arguments that find tags: Beautiful Soup will find all tags whose .string matches your value for string. This code finds the tags whose .string is "Elsie":
soup.find_all("a", string="Elsie") # [<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>]
Now let's take a look what
Tag
'sstring
attribute is (from the docs again):If a tag has only one child, and that child is a NavigableString, the child is made available as .string:
title_tag.string # u'The Dormouse's story'
(...)
If a tag contains more than one thing, then it’s not clear what .string should refer to, so .string is defined to be None:
print(soup.html.string) # None
This is exactly your case. Your
<a>
tag contains a text and<i>
tag. Therefore, the find getsNone
when trying to search for a string and thus it can't match.How to solve this?
Maybe there is a better solution but I would probably go with something like this:
import re from bs4 import BeautifulSoup as BS soup = BS(""" <a href="/customer-menu/1/accounts/1/update"> <i class="fa fa-edit"></i> Edit </a> """) links = soup.find_all('a', href="/customer-menu/1/accounts/1/update") for link in links: if link.find(text=re.compile("Edit")): thelink = link break print(thelink)
I think there are not too many links pointing to
/customer-menu/1/accounts/1/update
so it should be fast enough.这篇关于BeautifulSoup - 通过文本标签内搜索的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!