通过循环 rvest::follow_link() 函数来抓取链接的 HTML 网页 [英] Scraping linked HTML webpages by looping the rvest::follow_link() function
本文介绍了通过循环 rvest::follow_link() 函数来抓取链接的 HTML 网页的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!
问题描述
如何循环rvest::follow_link()
函数来抓取链接的网页?
How can I loop the rvest::follow_link()
function to scrape linked webpages?
用例:
- 确定所有 Lego Movie 演员
- 关注所有乐高电影演员链接
- 为所有演员准备一张每部电影(+ 年份)的表格
我需要的必需选择器如下:
The required selectors I need are below:
library(rvest)
lego_movie <- html("http://www.imdb.com/title/tt1490017/")
lego_movie <- lego_movie %>%
html_nodes(".itemprop , .character a") %>%
html_text()
# follow cast links
(".itemprop .itemprop")
# grab tables of all movies and dates for each cast member
(".year_column , b a")
期望的输出:
castMember movie year
Will Arnett Lego 2017
Will Arnett BoJack 2014
Will Arnett Wander 2014
............
Elizabeth Banks Moonbeam 2015
Elizabeth Banks Wet Hot 2015
............
Alison Brie Get Hard 2015
Alison Brie GetaJob 2015
.....etc.....
推荐答案
也许这样的方法可行.
library(rvest)
library(stringr)
library(data.table)
lego_movie <- read_html("http://www.imdb.com/title/tt1490017/")
cast <- lego_movie %>%
html_nodes("#titleCast .itemprop span") %>%
html_text()
cast
s <- html_session("http://www.imdb.com/title/tt1490017/")
cast_movies <- list()
for(i in cast[1:3]){
actorpage <- s %>% follow_link(i) %>% read_html()
cast_movies[[i]]$movies <- actorpage %>%
html_nodes("b a") %>% html_text() %>% head(10)
cast_movies[[i]]$years <- actorpage %>%
html_nodes("#filmography .year_column") %>% html_text() %>%
head(10) %>% str_extract("[0-9]{4}")
cast_movies[[i]]$name <- rep(i, length(cast_movies[[i]]$years))
}
cast_movies
as.data.frame(cast_movies[[1]])
rbindlist(cast_movies)
这篇关于通过循环 rvest::follow_link() 函数来抓取链接的 HTML 网页的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!
查看全文