Beautifulsoup Python Youtube Scrape无法正常工作 [英] Beautifulsoup Python Youtube Scrape not working

查看:93
本文介绍了Beautifulsoup Python Youtube Scrape无法正常工作的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试从格式为https://www.youtube.com/c/%s/videos %accountName的youtube帐户中抓取YouTube网址+标题.例如 Apple

在YouTube中为可点击的文本(标题)提供的类为ytd-grid-video-renderer #video-title.yt-simple-endpoint.ytd-grid-video-renderer-在检查器模式(Firefox)中单击标题对象时

我没有得到任何结果,但是request.content中显示了URL'url'(在 webCommandMetadata 中的某个位置)和标题'simpleText'.

示例:

url = "https://www.youtube.com/c/%s/videos" % account
req = requests.get(url, timeout=30)
soup = BeautifulSoup(req.content, 'html.parser')
# latest_videos_html = soup.select('.yt-lockup-content:not(:has(span.yt-uix-livereminder)) .yt-lockup-title a')[:6]
# latest_videos_html = soup.select('.yt-lockup-content:not(:has(span.yt-uix-livereminder)) .yt-simple-endpoint a')[:18]
latest_videos_html = soup.select('ytd-grid-video-renderer #video-title.yt-simple-endpoint.ytd-grid-video-renderer')[:18]

print(latest_videos_html)`

我的问题是:我如何知道在soup.select中输入的内容以及如何调试它,以便将来自己解决?

感谢您的支持!

解决方案

您在浏览器中看到的内容主要由javascript加载.通过使用简单的GET请求,您不会收到页面的动态内容.

通过查看YouTube上的用户页面,我可以看到您没有得到很多正确的HTML信息,而是在body标签中获得了JSON.

要回答您的问题,将来在您要从网站上抓取某些内容时,请首先确保使用requests.get时确实拥有内容,而不是假设您获得的内容与浏览器相同.

现在,专门针对YouTube问题,如果将req.text保存在文件中,然后在文件编辑器中将其打开并打开<body>标记,则会在<script>标记下看到它(第二个)变量window["ytInitialData"]设置为非常长的JSON.

其中包含每个视频所需的所有可用信息(标题,时长,视频ID等).我建议您解析该JSON,看看它是否可以解决您的问题.

I'm trying to scrape Youtube URLs + Title from youtube accounts which are formatted like https://www.youtube.com/c/%s/videos %accountName. for example Apple

The class given to the clickable text (title) in Youtube is ytd-grid-video-renderer #video-title.yt-simple-endpoint.ytd-grid-video-renderer - When clicking on the title object in inspector mode (Firefox)

I am not getting any results, but the url 'url' (somewhere in webCommandMetadata) and title 'simpleText' are showing in the request.content

Example:

url = "https://www.youtube.com/c/%s/videos" % account
req = requests.get(url, timeout=30)
soup = BeautifulSoup(req.content, 'html.parser')
# latest_videos_html = soup.select('.yt-lockup-content:not(:has(span.yt-uix-livereminder)) .yt-lockup-title a')[:6]
# latest_videos_html = soup.select('.yt-lockup-content:not(:has(span.yt-uix-livereminder)) .yt-simple-endpoint a')[:18]
latest_videos_html = soup.select('ytd-grid-video-renderer #video-title.yt-simple-endpoint.ytd-grid-video-renderer')[:18]

print(latest_videos_html)`

My question is: How do I know what to input in the soup.select and how do I debug this so I could fix this in the future myself?

Thanks for your support!

解决方案

The content you see in the browser is loaded mostly by javascript. By using simple GET requests you do not receive the dynamic content of the page.

By looking at users' pages on YouTube, I can see you do not get a lot of proper HTML information, but rather you get JSONs in the body tag.

To answer your question, in the future when you want to scrape something from a website, first make sure you actually have the content when using requests.get rather than assuming that you get the same content a browser gets.

Now, specifically for the YouTube problem, if you save req.text in a file and open it in a file editor and open the <body> tag, you will see that under the <script> tag (the second one) the variable window["ytInitialData"] is set to a very-very long JSON.

Inside it there is all the available info you need for every video (title, duration, video ID, etc.). I suggest you parse that JSON and see if it solves your problem.

这篇关于Beautifulsoup Python Youtube Scrape无法正常工作的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆