R中的Web Scraping与来自data.frame的循环 [英] Web Scraping in R with loop from data.frame

查看:243
本文介绍了R中的Web Scraping与来自data.frame的循环的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

library(rvest)

df <- data.frame(Links = c("Qmobile_Noir-M6", "Qmobile_Noir-A1", "Qmobile_Noir-E8"))

for(i in 1:3) {
  webpage <- read_html(paste0("https://www.whatmobile.com.pk/", df$Links[i]))
  data <- webpage %>%
    html_nodes(".specs") %>%
    .[[1]] %>% 
    html_table(fill = TRUE)
}

想要使循环对于df$Links中的所有3个值均有效,但上述代码仅下载了最后一个,下载的数据还必须与变量相同(可能是带有变量名称的新列)

want to make loop works for all 3 values in df$Links but above code just download the last one, and downloaded data must also be identical with variables (may be a new column with variables name)

推荐答案

问题出在如何构造for循环.但是,首先不用一个就容易得多,因为R对遍历列表(例如lapplypurrr::map)提供了强大的支持.一种如何构造数据的版本:

The problem is in how you're structuring your for loop. It's much easier just to not use one in the first place, though, as R has great support for iterating over lists, like lapply and purrr::map. One version of how you could structure your data:

library(tidyverse)
library(rvest)

base_url <- "https://www.whatmobile.com.pk/"

models <- data_frame(model = c("Qmobile_Noir-M6", "Qmobile_Noir-A1", "Qmobile_Noir-E8"),
           link = paste0(base_url, model),
           page = map(link, read_html))

model_specs <- models %>% 
    mutate(node = map(page, html_node, '.specs'),
           specs = map(node, html_table, header = TRUE, fill = TRUE),
           specs = map(specs, set_names, c('var1', 'var2', 'val1', 'val2'))) %>% 
    select(model, specs) %>% 
    unnest()

model_specs
#> # A tibble: 119 x 5
#>              model      var1       var2
#>              <chr>     <chr>      <chr>
#>  1 Qmobile_Noir-M6     Build         OS
#>  2 Qmobile_Noir-M6     Build Dimensions
#>  3 Qmobile_Noir-M6     Build     Weight
#>  4 Qmobile_Noir-M6     Build        SIM
#>  5 Qmobile_Noir-M6     Build     Colors
#>  6 Qmobile_Noir-M6 Frequency    2G Band
#>  7 Qmobile_Noir-M6 Frequency    3G Band
#>  8 Qmobile_Noir-M6 Frequency    4G Band
#>  9 Qmobile_Noir-M6 Processor        CPU
#> 10 Qmobile_Noir-M6 Processor    Chipset
#> # ... with 109 more rows, and 2 more variables: val1 <chr>, val2 <chr>

数据仍然很混乱,但至少全部都在这里.

The data is still pretty messy, but at least it's all there.

这篇关于R中的Web Scraping与来自data.frame的循环的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆