从BeautifulSoup中的findall()检索href的子集 [英] Retrieving a subset of href's from findall() in BeautifulSoup

查看:400
本文介绍了从BeautifulSoup中的findall()检索href的子集的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我的目标是编写一个将艺术家姓名作为字符串输入的python脚本,然后将其附加到genius搜索查询的基本URL中,然后从返回的网页链接中检索所有歌词(这个问题的必需子集,该子集的每个链接中也将特别包含艺术家名称.)我现在处于初始阶段,并且能够从网页中检索所有链接,包括我不知道的链接不需要我的子集.我试图找到一个简单的解决方案,但连续失败.

My goal is to write a python script that takes an artist's name as a string input and then appends it to the base URL that goes to the genius search query.Then retrieves all the lyrics from the returned web page's links (Which is the required subset of this problem that will also contain specifically the artist name in every link in that subset.).I am in the initial phase right now and just have been able to retrieve all links from the web page including the ones that I don't want in my subset. I tried to find a simple solution but failed continuously.

import requests
# The Requests library.

from bs4 import BeautifulSoup
from lxml import html

user_input = input("Enter Artist Name = ").replace(" ","+")
base_url = "https://genius.com/search?q="+user_input

header = {'User-Agent':''}
response = requests.get(base_url, headers=header)

soup = BeautifulSoup(response.content, "lxml")

for link in soup.find_all('a',href=True):
        print (link['href'])

这将返回此完整列表,而我只需要以歌词和歌手姓名(例如Drake)结尾的列表.这些将提供我应该能够从那里检索歌词的链接.

This returns this complete list while I only need the ones that end with lyrics and the artist's name (here for instance Drake). These will the links from where I should be able to retrieve the lyrics.

https://genius.com/
/signup
/login
https://www.facebook.com/geniusdotcom/
https://twitter.com/Genius
https://www.instagram.com/genius/
https://www.youtube.com/user/RapGeniusVideo
https://genius.com/new
https://genius.com/Drake-hotline-bling-lyrics
https://genius.com/Drake-one-dance-lyrics
https://genius.com/Drake-hold-on-were-going-home-lyrics
https://genius.com/Drake-know-yourself-lyrics
https://genius.com/Drake-back-to-back-lyrics
https://genius.com/Drake-all-me-lyrics
https://genius.com/Drake-0-to-100-the-catch-up-lyrics
https://genius.com/Drake-started-from-the-bottom-lyrics
https://genius.com/Drake-from-time-lyrics
https://genius.com/Drake-the-motto-lyrics
/search?page=2&q=drake
/search?page=3&q=drake
/search?page=4&q=drake
/search?page=5&q=drake
/search?page=6&q=drake
/search?page=7&q=drake
/search?page=8&q=drake
/search?page=9&q=drake
/search?page=672&q=drake
/search?page=673&q=drake
/search?page=2&q=drake
/embed_guide
/verified-artists
/contributor_guidelines
/about
/static/press
mailto:brands@genius.com
https://eventspace.genius.com/
/static/privacy_policy
/jobs
/developers
/static/terms
/static/copyright
/feedback/new
https://genius.com/Genius-how-genius-works-annotated
https://genius.com/Genius-how-genius-works-annotated

我的下一步是使用硒来模拟滚动,在genius.com的情况下,滚动将提供整个搜索结果集.任何建议或资源将不胜感激.我还想对我希望继续采用该解决方案的方式发表一些意见.我们可以使其更通用吗?

My next step would be to use selenium to emulate scroll which in the case of genius.com gives the entire set of search results. Any suggestions or resources would be appreciated. I would also like a few comments about the way I wish to proceed with this solution. Can we make it more generic?

P.S.我可能不太清楚地解释我的问题,但是我已经尽力了.此外,任何含糊之处也欢迎.我也是爬虫,python和编程的新手,只是想确保我遵循正确的道路.

P.S. I may not have well lucidly explained my problem but I have tried my best. Also, any ambiguities are welcome too. I am new to scraping and python and programming as well in so, just wanted to make sure that I am following the right path.

推荐答案

使用正则表达式模块仅匹配所需的链接.

Use the regex module to match only the links you want.

import requests
# The Requests library.

from bs4 import BeautifulSoup
from lxml import html
from re import compile

user_input = input("Enter Artist Name = ").replace(" ","+")
base_url = "https://genius.com/search?q="+user_input

header = {'User-Agent':''}
response = requests.get(base_url, headers=header)

soup = BeautifulSoup(response.content, "lxml")

pattern = re.compile("[\S]+-lyrics$")

for link in soup.find_all('a',href=True):
    if pattern.match(link['href']):
        print (link['href'])

输出:

https://genius.com/Drake-hotline-bling-lyrics
https://genius.com/Drake-one-dance-lyrics
https://genius.com/Drake-hold-on-were-going-home-lyrics
https://genius.com/Drake-know-yourself-lyrics
https://genius.com/Drake-back-to-back-lyrics
https://genius.com/Drake-all-me-lyrics
https://genius.com/Drake-0-to-100-the-catch-up-lyrics
https://genius.com/Drake-started-from-the-bottom-lyrics
https://genius.com/Drake-from-time-lyrics
https://genius.com/Drake-the-motto-lyrics

这只是看您的链接是否匹配以-lyrics结尾的模式.您也可以使用类似的逻辑来过滤user_input变量.

This just looks if your link matches the pattern ending in -lyrics. You may use similar logic to filter using user_input variable as well.

希望这会有所帮助.

这篇关于从BeautifulSoup中的findall()检索href的子集的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆