Python:在 HTML <a> 中查找特定链接标签 [英] Python: Find specific link within HTML &lt;a&gt; tag

查看:40
本文介绍了Python:在 HTML <a> 中查找特定链接标签的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在 Python 中,我有一个包含网站源代码的字符串.在此源代码中,如果标签包含特定子字符串,我想获取标签内的链接.

In Python I have a string containing the sourcecode of a website. Within this sourcecode I want to get the link within an tag, if the tag contains a specific substring.

输入例如看起来像这样:

The input e.g. looks like this:

AnyKindOfString <a href="http://www.link-to-get.com">SearchString</a> AndEvenMoreString

所以我想告诉 Python 的是在字符串中的所有标签中搜索 SearchString 并给我第一个找到的 http://www.link-to-get.com 返回.

So what I want to tell Python is to search for SearchString in the all tags within string and give me the first found http://www.link-to-get.com back.

这应该只在 SearchString 在标签内时有效 - 如果SearchString"是 http://www.link-to 的一部分(子字符串),它也应该有效-get.com.

This should only work, if SearchString is within the tag - and it should also work, if "SearchString" is part (substring) of http://www.link-to-get.com.

我正在寻找一个超过 30 分钟知道的答案,我发现的唯一 Python 方法就是从字符串中提取每个(或仅外部或仅内部)链接.

I'm searching for an answer like more than 30 minutes know and the only thing I found for Python was simply to extract every (or only external or only internal) links from a string.

有人有想法吗?

提前谢谢!

推荐答案

使用 BeautifulSoup 3.2.1 和 python 2.7

using BeautifulSoup 3.2.1 with python 2.7

from BeautifulSoup import BeautifulSoup

search_string = 'SearchString'

website_source = '<a href="http://www.link-to-get.com">SearchString</a> <a href="http://www.link-to-get.com">OtherString</a>\
                  <a href="http://www.link-to-getSearchString.com">otherString</a>'

soup = BeautifulSoup(website_source)

# this will return a list of lists that has the url's and the name for the link
anchors = [[row['href'], row.text] for row in soup.findAll('a') if row['href'].find(search_string) <> -1 or search_string in row.text]

# prints whole list
print anchors

#prints first list
print anchors[0]

# prints the url for the first list
print anchors[0][0]

问题似乎是我使用仅适用于 python 2.x 的 BeautifulSoup 3.2.1 测试了上述内容,而您使用的是 python 3.4,因此出现错误.
如果您安装 BeautifulSoup4 并尝试下面的代码,它应该可以工作.还要注意的是,BeautifulSoup4 在 2.x 和 3.x 中都可以使用.

The issue seems to be that I tested the above with BeautifulSoup 3.2.1 which only works in python 2.x and you are using python 3.4 hence the error.
If you install BeautifulSoup4 and try the below code it should work. also to note that BeautifulSoup4 which works in both 2.x and 3.x.

请注意,以下内容未经测试.

Please note that the below has not been tested.

from bs4 import BeautifulSoup

search_string = 'SearchString'

website_source = '<a href="http://www.link-to-get.com">SearchString</a> <a href="http://www.link-to-get.com">OtherString</a>\
                  <a href="http://www.link-to-getSearchString.com">otherString</a>'

soup = BeautifulSoup(website_source)

# this will return a list of lists that has the url's and the name for the link
anchors = [[row['href'], row.text] for row in soup.findAll('a') if row['href'].find(search_string) != -1 or search_string in row.text]

# prints whole list
print(anchors)

# prints first list
print(anchors[0])

# prints the url for the first list
print(anchors[0][0])

这篇关于Python:在 HTML <a> 中查找特定链接标签的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆