在 R 中抓取 Youtube 评论 [英] Scraping Youtube comments in R

查看:42
本文介绍了在 R 中抓取 Youtube 评论的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在从一系列网站(如 reddit.com)中提取用户评论,而 Youtube 对我来说也是另一个多汁的信息来源.我现有的刮板是用 R 编写的:

I'm extracting user comments from a range of websites (like reddit.com) and Youtube is also another juicy source of information for me. My existing scraper is written in R:

# x is the url
html = getURL(x)
doc  = htmlParse(html, asText=TRUE) 
txt  = xpathSApply(doc, 
   //body//text()[not(ancestor::script)][not(ancestor::style)][not(ancestor::noscript)]",xmlValue) 

这不适用于 Youtube 数据,事实上,如果您查看像 这样的 Youtube 视频的来源 例如,您会发现评论没有出现在来源中.

This doesn't work on Youtube data, in fact if you look at the source of a Youtube video like this for example, you'd find that comments do not appear in the source.

有人对如何在这种情况下提取数据有任何建议吗?

Does anyone have any suggestions on how to extract data in such circumstances?

非常感谢!

推荐答案

按照这个答案:R: rvest:抓取动态电子商务页面

您可以执行以下操作:

devtools::install_github("ropensci/RSelenium") # Install from github

library(RSelenium)
library(rvest)
pJS <- phantom(pjs_cmd = "PATH TO phantomjs.exe") # as i am using windows
Sys.sleep(5) # give the binary a moment
remDr <- remoteDriver(browserName = 'phantomjs')
remDr$open()
remDr$navigate("https://www.youtube.com/watch?v=qRC4Vk6kisY")
remDr$getTitle()[[1]] # [1] "YouTube"

# scroll down
for(i in 1:5){      
  remDr$executeScript(paste("scroll(0,",i*10000,");"))
  Sys.sleep(3)    
}

# Get page source and parse it via rvest
page_source <- remDr$getPageSource()
author <- html(page_source[[1]]) %>% html_nodes(".user-name") %>% html_text()
text <- html(page_source[[1]]) %>% html_nodes(".comment-text-content") %>% html_text()

#combine the data in a data.frame
dat <- data.frame(author = author, text = text)

Result:
> head(dat)
              author                                                                                       text
1 Kikyo bunny simpie Omg I love fluffy puff she's so adorable when she was dancing on a rainbow it's so cute!!!
2   Tatjana Celinska                                                                                     Ciao 0
3      Yvette Austin                                                                   GET OUT OF MYÂ  HEAD!!!!
4           Susan II                                                                             Watch narhwals
5        Greg Ginger               who in the entire fandom never watched this, should be ashamed,\n\nPFFFTT!!!
6        Arnav Sinha                                                                 LOL what the hell is this?

注释 1:您确实需要 github 版本,请参阅 rselenium |获取 youtube 页面源

Comment 1: You do need the github version see rselenium | get youtube page source

评论 2:此代码为您提供了最初的 44 条注释.一些评论有一个显示所有答案"链接,必须点击.此外,要查看更多评论,您必须单击页面底部的显示更多按钮.在这个优秀的 RSelenium 教程中解释了单击:http://cran.r-project.org/web/packages/RSelenium/vignettes/RSelenium-basics.html

Comment 2: This code gives you the initial 44 comments. Some comments have a "show all answers" link that would have to click. Also to see even more comments you have to click the show more button at the bottom of the page. Clicking is explined in this excelent RSelenium tutorial: http://cran.r-project.org/web/packages/RSelenium/vignettes/RSelenium-basics.html

这篇关于在 R 中抓取 Youtube 评论的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆