获取Youtube搜索结果的链接 [英] Getting links of Youtube search result

查看:74
本文介绍了获取Youtube搜索结果的链接的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试获取出现在YouTube特定查询的搜索结果中的视频的链接.我正在使用BeautifulSoup并请求Python的库,这是我所做的:

I am trying to get links of videos that appear in search result for a particular query on YouTube. I am using BeautifulSoup and requests library of Python and here is what I did:

from bs4 import BeautifulSoup as bs
import requests 
import pandas as pd

base="https://www.youtube.com/results?search_query="
query="mickey+mouse"
r = requests.get(base+query)
page=r.text
soup=bs(page,'html.parser')

vids = soup.findAll('a',attrs={'class':'yt-uix-tile-link'})

videolist=[]
for v in vids:
    tmp = 'https://www.youtube.com' + v['href']
    videolist.append(tmp)

pd.DataFrame(videolist).to_excel(<PATH>, header=False, index=False)

这将查找搜索结果,并将前20个视频(出现在页面中)的链接保存到excel文件中.但是,我希望获得与同一查询相关的400或500个链接.我该怎么办?我知道如何从特定渠道获取所有链接,但是如何获取特定搜索查询的链接?

This looks for search results and saves the links for first 20 videos (that appear in a page) to an excel file. However, I wish to obtain, say, 400 or 500 links related to the same query. How can I do so ? I know how to do get all links from a particular channel but how to get links for a particular search query ?

推荐答案

除了导出到Excel外,有人创建的内容与您创建的完全一样,而是在另一个SE上导出为CSV.

Someone created pretty much exactly what you're after apart from the exporting to Excel but rather does export to CSV on a different SE.

不幸的是,SO不允许我粘贴来自不同SE网站的可能重复的答案.

Unfortunately SO doesn't let me paste possible duplicate answers from different SE sites.

#!/usr/bin/python
# http://docs.python-requests.org/en/latest/user/quickstart/
# http://www.crummy.com/software/BeautifulSoup/bs4/doc/

import csv
import re
import requests
import time
from bs4 import BeautifulSoup

# scrapes the title 
def getTitle():
    d = soup.find_all("h1", "branded-page-header-title")
    for i in d:
        name = i.text.strip().replace('\n',' ').replace(',','').encode("utf-8")
        f.write(str(name) + ',')
        print(f'\t\t{name}')

# scrapes the subscriber and view count
def getStats():
    b = soup.find_all("li", "about-stat ") # trailing space is required.
    for i in b:
        value = i.b.text.strip().replace(',','')
        name = i.b.next_sibling.strip().replace(',','')
        f.write(value+',')
        print('\t\t%s = %s') % (name, value)

# scrapes the description
def getDescription():
    c = soup.find_all("div", "about-description")
    for i in c:
        description = i.text.strip().replace('\n',' ').replace(',','').encode("utf-8")
        f.write(str(description) + ',')
        #print('\t\t%s') % (description)

# scrapes all the external links 
def getLinks():
    a = soup.find_all("a", "about-channel-link ") # trailing space is required.
    for i in a:
        url = i.get('href')
        f.write(url+',')
        print(f'\t\t{url}')

# scrapes the related channels
def getRelated():
    s = soup.find_all("h3", "yt-lockup-title")
    for i in s:
        t = i.find_all(href=re.compile("user"))
        for i in t:
            url = 'https://www.youtube.com'+i.get('href')
            rCSV.write(url+'\n')
            print(f'\t\t{i.text}, {url}')  

f = open("youtube-scrape-data.csv", "w+")
rCSV = open("related-channels.csv", "w+")
visited = []
base = "https://www.youtube.com/results?search_query="
q = ['search+query+here']
page = "&page="
features="html.parser"
count = 1
pagesToScrape = 20

for query in q:
    while count <= pagesToScrape:
        scrapeURL = base + str(query) + page + str(count)
        print(f'Scraping {scrapeURL} \n')
        r = requests.get(scrapeURL)
        soup = BeautifulSoup(r.text)
        users = soup.find_all("div", "yt-lockup-byline")
        for each in users:
            a = each.find_all(href=re.compile("user"))
            for i in a:
                url = 'https://www.youtube.com'+i.get('href')+'/about'
                if url in visited:
                    print(f'\t{url} has already been scraped\n\n')
                else:
                    r = requests.get(url)
                    soup = BeautifulSoup(r.text)
                    f.write(url+',')
                    print(f'\t{url}')
                    getTitle()
                    getStats()
                    getDescription()
                    getLinks()
                    getRelated()
                    f.write('\n')   
                    print('\n')
                    visited.append(url)
                    time.sleep(3)
        count += 1  
        time.sleep(3)
        print('\n')
    count = 1
    print('\n') 
f.close()

来源: https://codereview.stackexchange.com/questions/92001/youtube-search -结果抓取工具

这篇关于获取Youtube搜索结果的链接的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆