通过循环 rvest::follow_link() 函数来抓取链接的 HTML 网页 [英] Scraping linked HTML webpages by looping the rvest::follow_link() function

查看:50
本文介绍了通过循环 rvest::follow_link() 函数来抓取链接的 HTML 网页的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

如何循环rvest::follow_link() 函数来抓取链接的网页?

How can I loop the rvest::follow_link() function to scrape linked webpages?

用例:

  1. 确定所有 Lego Movie 演员
  2. 关注所有乐高电影演员链接
  3. 为所有演员准备一张每部电影(+ 年份)的表格

我需要的必需选择器如下:

The required selectors I need are below:

library(rvest)
lego_movie <- html("http://www.imdb.com/title/tt1490017/")
lego_movie <- lego_movie %>%
  html_nodes(".itemprop , .character a") %>%
  html_text()

# follow cast links
(".itemprop .itemprop") 

# grab tables of all movies and dates for each cast member
(".year_column , b a")

期望的输出:

castMember       movie    year
Will Arnett      Lego     2017
Will Arnett      BoJack   2014
Will Arnett      Wander   2014
        ............
Elizabeth Banks  Moonbeam 2015
Elizabeth Banks  Wet Hot  2015
        ............
Alison Brie      Get Hard 2015
Alison Brie      GetaJob  2015
        .....etc.....

推荐答案

也许这样的方法可行.

library(rvest)
library(stringr)
library(data.table)
lego_movie <- read_html("http://www.imdb.com/title/tt1490017/")
cast <- lego_movie %>%
    html_nodes("#titleCast .itemprop span") %>%
    html_text()
cast

s <- html_session("http://www.imdb.com/title/tt1490017/")

cast_movies <- list()

for(i in cast[1:3]){
    actorpage <- s %>% follow_link(i) %>% read_html()
    cast_movies[[i]]$movies <-  actorpage %>% 
        html_nodes("b a") %>% html_text() %>% head(10)
    cast_movies[[i]]$years <- actorpage %>%
        html_nodes("#filmography .year_column") %>% html_text() %>% 
        head(10) %>% str_extract("[0-9]{4}")
    cast_movies[[i]]$name <- rep(i, length(cast_movies[[i]]$years))
}

cast_movies
as.data.frame(cast_movies[[1]])
rbindlist(cast_movies)

这篇关于通过循环 rvest::follow_link() 函数来抓取链接的 HTML 网页的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆