使用RVest从R中的多个页面抓取数据 [英] scraping data from multiple pages in R using rvest

查看:207
本文介绍了使用RVest从R中的多个页面抓取数据的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我是r的新手,正在尝试从Goodreads.com上获取数据以进行数据分析项目.我需要脚本帮助,以获取书评和审阅日期.但是此数据在多个页面上,并且许多评论都被截断了.请获取我需要的帮助,因为我必须收集大约50本书的评论.谢谢

I am new to r and am trying to get data from Goodreads.com for a data analysis project. I need help with script to get the book reviews along with review date. but this data are on multiple pages and many of the reviews are truncated. Please I need help get this data as I have to collect reviews on about 50 books. Thanks

推荐答案

好吧,您没有发布特定的URL,所以我将向您展示几个如何遍历多个URL并获取不同种类的通用示例.数据集.

Well, you didn't post a specific URL, so I'll show you a couple generic samples of how to iterate through several URLs, and grab different kinds of data sets.

示例1:

library(rvest)
library(stringr)

#create a master dataframe to store all of the results
complete <- data.frame()

yearsVector <- c("2010", "2011", "2012", "2013", "2014", "2015")
#position is not needed since all of the info is stored on the page
#positionVector <- c("qb", "rb", "wr", "te", "ol", "dl", "lb", "cb", "s")
positionVector <- c("qb")
for (i in 1:length(yearsVector)) {
    for (j in 1:length(positionVector)) {
        # create a url template 
        URL.base <- "http://www.nfl.com/draft/"
        URL.intermediate <- "/tracker?icampaign=draft-sub_nav_bar-drafteventpage-tracker#dt-tabs:dt-by-position/dt-by-position-input:"
        #create the dataframe with the dynamic values
        URL <- paste0(URL.base, yearsVector[i], URL.intermediate, positionVector[j])
        #print(URL)

        #read the page - store the page to make debugging easier
        page <- read_html(URL)

        #find records for each player
        playersloc <- str_locate_all(page, "\\{\"personId.*?\\}")[[1]]
        # Select the first column [, 1] and select the second column [, 2]
        players <- str_sub(page, playersloc[, 1] + 1, playersloc[, 2] - 1)
        #fix the cases where the players are named Jr.
        players <- gsub(", ", "_", players)

        #split and reshape the data in a data frame
        play2 <- strsplit(gsub("\"", "", players), ',')
        data <- sapply(strsplit(unlist(play2), ":"), FUN = function(x) { x[2] })
        df <- data.frame(matrix(data, ncol = 16, byrow = TRUE))
        #name the column names
        names(df) <- sapply(strsplit(unlist(play2[1]), ":"), FUN = function(x) { x[1] })


        #store the temp values into the master dataframe
        complete <- rbind(complete, df)
    }
}

示例2:

webpage <- read_html(site)
draft_table <- html_nodes(webpage, 'table')
draft <- html_table(draft_table)[[1]]


jump <- seq(0, 800, by = 100)
site <- paste('http://www.basketball-reference.com/play-index/draft_finder.cgi?',
              'request=1&year_min=2001&year_max=2014&round_min=&round_max=',
              '&pick_overall_min=&pick_overall_max=&franch_id=&college_id=0',
              '&is_active=&is_hof=&pos_is_g=Y&pos_is_gf=Y&pos_is_f=Y&pos_is_fg=Y',
              '&pos_is_fc=Y&pos_is_c=Y&pos_is_cf=Y&c1stat=&c1comp=&c1val=&c2stat=&c2comp=',
              '&c2val=&c3stat=&c3comp=&c3val=&c4stat=&c4comp=&c4val=&order_by=year_id',
              '&order_by_asc=&offset=', jump, sep="")

dfList <- lapply(site, function(i) {
    webpage <- read_html(i)
    draft_table <- html_nodes(webpage, 'table')
    draft <- html_table(draft_table)[[1]]
})

finaldf <- do.call(rbind, dfList)

这篇关于使用RVest从R中的多个页面抓取数据的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆