用bs4查找特定的链接文本 [英] Find specific link text with bs4

查看:105
本文介绍了用bs4查找特定的链接文本的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我试图抓取一个网站,并找到所有饲料的标题。我无法获取我需要的 a 标签的文本。这是一个html的例子。

 < td class =mid =b1>< a href =/ QSYcfTid =c1target =_ blankonClick =vPI('https://www.youtube.com/watch?v=BFNH-6K10Ic','QSYcfT',this.id); this.blur(); return false;> TF4  -  Oreos< / a> < a href =#onClick =return lkP('1','QSYcfT'); id =x1>< font class =bp>(0)< / font>< / a> 
< td class =mid =b2>< a href =/ zXHNvpid =c2target =_ blankonClick =vPI('https:// www。 youtube.com/watch?v=0vjcGwZGBYI','zXHNvp',this.id); this.blur(); return false;> Awesome Game Boy Facts< / a> < a href =#onClick =return lkP('2','zXHNvp'); id =x2>< font class =bp>(0)< / font>< / a>

我正在为每个 a 标记,标识为 c ,并在新行上打印。 我的输出应该如下所示。

  TF4  - 奥利奥
真棒游戏男孩的事实


$ b

到目前为止, (html)
links = soup.find_all('a',{'id':'c'})
链接链接:
print link.text
code>

但它没有找到或打印任何内容?

解决方案

您可以传递正则表达式代替属性值:

  links = soup.find_all('a',{'id': re.compile('^ c \d +')})

^ 表示字符串的开头, \ d + 匹配一个字符串或更多位数。



演示:

 >>> ; import re 
>>> from bs4 import BeautifulSoup
>>>
>>> html =
...< tr>
...< td class =mid =b1>< a href =/ QSYcfTid = c1target =_ blankonClick =vPI('https://www.youtube.com/watch?v=BFNH-6K10Ic','QSYcfT',this.id); this.blur();返回false;> TF4 - Oreos< / a>< a href =#onClick =return lkP('1','QSYcfT');id =x1>< font class = (0)< / font>< / a>< / td>
...< td class =mid =b2>< a href = / zXHNvpid =c2target =_ blankonClick =vPI('https://www.youtube.com/watch?v=0vjcGwZGBYI','zXHNvp',this.id); this.blur();返回false;> Awesome Game Boy Facts< / a>< a href =#onClick =return lkP('2','zXHNvp');id =x2>< font class = bp>(0)< / font>< / a>< / td>
...< / tr>
...
> ;>>汤= BeautifulSoup(html)
>>> links = soup.find_all('a',{'id':re.compile('^ c \d +')})
>>>链接链接:
...打印link.text
...
TF4 - 奥利奥
真棒游戏男孩的事实


I am trying to scrape a website and find all the headings of a feed. I am having trouble just getting the text of the a tag that I need. Here is an example of the html.

<td class="m" id="b1"><a href="/QSYcfT" id="c1" target="_blank" onClick="vPI('https://www.youtube.com/watch?v=BFNH-6K10Ic', 'QSYcfT', this.id); this.blur(); return false;">TF4 - Oreos</a> <a href="#" onClick="return lkP('1', 'QSYcfT');" id="x1"><font class="bp">(0)</font></a>
<td class="m" id="b2"><a href="/zXHNvp" id="c2" target="_blank" onClick="vPI('https://www.youtube.com/watch?v=0vjcGwZGBYI', 'zXHNvp', this.id); this.blur(); return false;">Awesome Game Boy Facts</a> <a href="#" onClick="return lkP('2', 'zXHNvp');" id="x2"><font class="bp">(0)</font></a>

I am trying to get the text for every a tag with a id of c and print each on a new line.

My output should look like this.

TF4 - Oreos
Awesome Game Boy Facts

So far I have tried.

soup = bs4.BeautifulSoup(html)
links = soup.find_all('a',{'id' : 'c'})
for link in links:
    print link.text

But it doesn't find or print anything?

解决方案

You can pass a regular expression in place of an attribute value:

links = soup.find_all('a', {'id': re.compile('^c\d+')})

^ means the beginning of a string, \d+ matches one or more digits.

Demo:

>>> import re
>>> from bs4 import BeautifulSoup
>>> 
>>> html = """
... <tr>
...     <td class="m" id="b1"><a href="/QSYcfT" id="c1" target="_blank" onClick="vPI('https://www.youtube.com/watch?v=BFNH-6K10Ic', 'QSYcfT', this.id); this.blur(); return false;">TF4 - Oreos</a> <a href="#" onClick="return lkP('1', 'QSYcfT');" id="x1"><font class="bp">(0)</font></a></td>
...     <td class="m" id="b2"><a href="/zXHNvp" id="c2" target="_blank" onClick="vPI('https://www.youtube.com/watch?v=0vjcGwZGBYI', 'zXHNvp', this.id); this.blur(); return false;">Awesome Game Boy Facts</a> <a href="#" onClick="return lkP('2', 'zXHNvp');" id="x2"><font class="bp">(0)</font></a></td>
... </tr>
... """
>>> soup = BeautifulSoup(html)
>>> links = soup.find_all('a', {'id': re.compile('^c\d+')})
>>> for link in links:
...     print link.text
... 
TF4 - Oreos
Awesome Game Boy Facts

这篇关于用bs4查找特定的链接文本的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆