跨多个页面的 R 网络抓取 [英] R web scraping across multiple pages

查看:36
本文介绍了跨多个页面的 R 网络抓取的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在开发一个网络抓取程序,以搜索特定的葡萄酒并返回该品种当地葡萄酒的列表.我遇到的问题是多页结果.下面的代码是我正在使用的基本示例

I am working on a web scraping program to search for specific wines and return a list of local wines of that variety. The problem I am having is multiple page results. The code below is a basic example of what I am working with

url2 <- "http://www.winemag.com/?s=washington+merlot&search_type=reviews"
htmlpage2 <- read_html(url2)
names2 <- html_nodes(htmlpage2, ".review-listing .title")
Wines2 <- html_text(names2)

对于这个特定的搜索,有 39 页的结果.我知道网址更改为 http://www.winemag.com/?s=washington%20merlot&drink_type=wine&page=2,但是有没有一种简单的方法可以让代码循环遍历所有返回的页面并将所有 39 个页面的结果编译到一个列表中?我知道我可以手动完成所有网址,但这似乎有点过分了.

For this specific search there are 39 pages of results. I know the url changes to http://www.winemag.com/?s=washington%20merlot&drink_type=wine&page=2, but is there an easy way to make the code loop through all the returned pages and compile the results from all 39 pages into a single list? I know I can manually do all the urls, but that seems like overkill.

推荐答案

如果你想要所有信息作为 data,你也可以用 purrr::map_df() 做类似的事情.frame:

You can do something similar with purrr::map_df() as well if you want all the info as a data.frame:

library(rvest)
library(purrr)

url_base <- "http://www.winemag.com/?s=washington merlot&drink_type=wine&page=%d"

map_df(1:39, function(i) {

  # simple but effective progress indicator
  cat(".")

  pg <- read_html(sprintf(url_base, i))

  data.frame(wine=html_text(html_nodes(pg, ".review-listing .title")),
             excerpt=html_text(html_nodes(pg, "div.excerpt")),
             rating=gsub(" Points", "", html_text(html_nodes(pg, "span.rating"))),
             appellation=html_text(html_nodes(pg, "span.appellation")),
             price=gsub("\$", "", html_text(html_nodes(pg, "span.price"))),
             stringsAsFactors=FALSE)

}) -> wines

dplyr::glimpse(wines)
## Observations: 1,170
## Variables: 5
## $ wine        (chr) "Charles Smith 2012 Royal City Syrah (Columbia Valley (WA)...
## $ excerpt     (chr) "Green olive, green stem and fresh herb aromas are at the ...
## $ rating      (chr) "96", "95", "94", "93", "93", "93", "93", "93", "93", "93"...
## $ appellation (chr) "Columbia Valley", "Columbia Valley", "Columbia Valley", "...
## $ price       (chr) "140", "70", "70", "20", "70", "40", "135", "50", "60", "3...

这篇关于跨多个页面的 R 网络抓取的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆