Rvest:抓取多个 URL [英] Rvest: Scrape multiple URLs

查看:56
本文介绍了Rvest:抓取多个 URL的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试通过 URL 列表循环抓取一些 IMDB 数据.不幸的是,我的输出并不完全是我所希望的,别介意将它存储在数据帧中.

I am trying to scrape some IMDB data looping through a list of URLs. Unfortunately my output isn't exactly what I hoped for, never mind storing it in a dataframe.

我获得网址

library(rvest)
topmovies <- read_html("http://www.imdb.com/chart/top")
links <- top250 %>%
  html_nodes(".titleColumn") %>%
  html_nodes("a") %>%
  html_attr("href")
links_full <- paste("http://imdb.com",links,sep="")
links_full_test <- links_full[1:10]

然后我可以用

lapply(links_full_test, . %>% read_html() %>% html_nodes("h1") %>% html_text())

但它是一个嵌套列表,我不知道如何将它放入 R 中正确的 data.frame 中.同样,如果我想获得另一个属性,请说

but it is a nested list and I don't know how to get it into a proper data.frame in R. Similarly, if I wanted to get another attribute, say

%>% read_html() %>% html_nodes("strong span") %>% html_text()

要检索 IMDB 评级,我得到相同的嵌套列表输出,最重要的是我必须执行 read_html() 两次......这需要很多时间.有一个更好的方法吗?我猜 for 循环,但我不能让它以这种方式工作:(

to retrieve the IMDB rating, I get the same nested-list output and most importantly I have to do read_html() twice ... which takes a lot of time. Is there a better way to do this? I guess for-loops, but I can't get it to work that way :(

推荐答案

这是使用 purrr 和 rvest 的一种方法.关键思想是保存解析后的页面,然后然后提取您感兴趣的部分.

Here's one approach using purrr and rvest. The key idea is to save the parsed page, and then extract the bits you're interested in.

library(rvest)
library(purrr)

topmovies <- read_html("http://www.imdb.com/chart/top")
links <- topmovies %>%
  html_nodes(".titleColumn") %>%
  html_nodes("a") %>%
  html_attr("href") %>% 
  xml2::url_absolute("http://imdb.com") %>% 
  .[1:5] # for testing

pages <- links %>% map(read_html)

title <- pages %>% 
  map_chr(. %>% 
    html_nodes("h1") %>% 
    html_text()
  )
rating <- pages %>% 
  map_dbl(. %>% 
    html_nodes("strong span") %>% 
    html_text() %>% 
    as.numeric()
  )

这篇关于Rvest:抓取多个 URL的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆