使用 rvest 包跨多个页面抓取内容 [英] Webscraping content across multiple pages using rvest package

查看:31
本文介绍了使用 rvest 包跨多个页面抓取内容的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我是一个非常初级的 R 程序员,但我一直在尝试使用 rvest 包从在线大学的网站上进行一些网络抓取.我从网页上抓取的第一个信息表是所有提供的博士级别课程的列表.这是我的代码:

库(xml2)图书馆(httr)图书馆(rvest)图书馆(选择器)

刮卡佩拉博士

fileUrl <- read_html("http://www.capella.edu/online-phd-programs/")

使用 chrome 中的选择器小工具工具,我能够选择我想要提取的网站上的内容.在这种情况下,我选择了所有博士级别的课程.

度数 <- fileUrl %>%html_nodes(".accordianparsys a") %>%html_text()学位

接下来,我创建了博士学位的数据框.

Capella_Doctoral = data.frame(度数)

下面我将创建另一列,将这些程序标记为来自 Capella.

Capella_Doctoral$SchoolFlag <- "Capella"查看(Capella_Doctoral)

在我上面的代码中,一切似乎都很好.但是,我想获取的下一种信息是每个博士课程的学费和学分.此信息存在于每个单独的博士课程页面上.例如,领导力博士课程将包含此页面上的学费和学分信息

以及jsfiddle中的输出https://jsfiddle.net/cbfas/0x37vudv/1/

I am a very novice R programmer, but I have been attempting to do some webscraping off of the website of an online university using the rvest package. The first table of information I scraped from the webpage was a listing of all of the doctoral level program offered. Here is my code:

library(xml2)
library(httr)
library(rvest)
library(selectr)

Scraping Capella Doctoral

fileUrl <- read_html("http://www.capella.edu/online-phd-programs/")

Using the selector gadget tool in chrome, I was able to select the content on the site I wanted to extract. In this case, I am selecting all doctoral level programs.

Degrees <- fileUrl %>%
html_nodes(".accordianparsys a") %>%
html_text() 
Degrees

Next, I created a data frame of the doctoral level degrees.

Capella_Doctoral = data.frame(Degrees)       

Below I am creating another column that flags these programs as coming from Capella.

Capella_Doctoral$SchoolFlag <- "Capella" 
View(Capella_Doctoral)

Everything seems to work great in my code above. However, the next type of information I would like to scrape is tuition cost and credit hours per doctoral program. This information exists on each individual doctoral program's page. For example, the PhD in Leadership program will contain the tuition and credit hour information on this page "http://www.capella.edu/online-degrees/phd-leadership/". The DBA in Accounting program will contain tuition and credit hour information on this page "http://www.capella.edu/online-degrees/dba-accounting/". The common theme among the various pages is that it includes the name of the program after "online-degrees/".

In order to create a list of the various web pages I need (those that include the doctoral program names), I developed the code below.

Formatting the doctoral degrees into lowercase, removing any leading and trailing whitespace, and then replacing any spaces with dashes

Lowercase <- tolower(Capella_Doctoral$Degrees) 
Lowercase

Removing leading and trailing whitespace

trim <- function (x) gsub("^\\s+|\\s+$", "", x)
Trim <- trim(Lowercase)
Trim

replacing spaces with dashes

Dashes <- gsub(" ", "-", Trim)
Dashes
Dashes2 <- gsub("---", "-", Dashes)
Dashes2

Next, I add the reformatted doctoral degrees to the end of the below url to get a listing of all of the possible urls I need to scrape information from about the tuition and credits hours for each program

urls <- rbindlist(sapply(Dashes2, function(x) {
    url <- paste("http://www.capella.edu/online-degrees/",x,"/", sep="")
    data.frame(url)
}), fill=TRUE)
Spec_URLs <- data.frame(urls)
View(Spec_URLs)

Now that I have a listing of all of the urls I need to scrape information from, I need to know how I can use the below function for each of the urls. The code below is only extracting tuition and credit hour info for one of the URLs. How do I get it to loop through all of the URLS? My end goal is to get a table of all of the tuition and credit hour information for each doctoral program into a data frame.

fileUrl <- read_html("http://www.capella.edu/online-degrees/phd-leadership/")

Tuition <- fileUrl %>%
   html_nodes("p:nth-child(4) strong , .tooltip~ strong") %>%
   html_text() 
Tuition

Results: Tuition [1] "120 Credits" "$4,665 per quarter"

解决方案

This is a quick and dirty....and I'm hoping it won't create more questions than answers. Essentially this function grabs all of the individual urls linking to departments...and then performs the same series on each returning one aggregated data object. In our case a data frame with 82 rows. If you wanted to clean this up you could reformat the columns and clean the NA's up a bit. Hope it works for you.

library(rvest)
library(stringi)
library(htmltools)
library(plyr)
library(dplyr)
library(DT)


# This is a helper function I threw on top..
txt.safe_text <- function(x){
  str_in <- iconv(x, "latin1", "ASCII", sub="")  %>%  stri_enc_toutf8()
  str_in %>%
    stri_replace_all_fixed('<U+0080><U+0093>',"'\\-'") %>%
    stri_enc_toascii %>% htmlEscape %>%
    stri_unescape_unicode %>%
    stri_replace_all_regex("\\032\\032\\032","-")%>%
    stri_replace_all_regex("\n","")
}




# Heres the iterator. I gave it zero args for purposes of the concept but you
# could add varible urls or filtering functions

parse.apella <- function(){


  # html() was deprecated but I use the older version of rvest so set the new name
  # to an alias for reproduction.
  read_html <- html


  # This is our index table. We are going to use this as a key to then qry all
  # other site info but keep a backref to the school variable and url
  idx_df <-
    lapply(read_html("http://www.capella.edu/online-phd-programs/") %>%
             html_nodes(".accordianparsys a"),function(i)
               data.frame(focus = html_text(i),
                          link = paste0("http://www.capella.edu", html_attr(i,"href"))
                          )) %>% rbind.pages

  # Threw this in for use case later with rendering a datatable and then being able to
  # jump straight to the site you are referencing.

  idx_df$html_output <- sapply(1:nrow(idx_df),function(i)
    htmltools::HTML(paste0(sprintf('<a href="%s">%s</a>',idx_df[i,2],idx_df[i,1]))))


  # Ok...so... for every index in our idx_df table above we are going to:
  # read site > parse the p html tags > pass a text cleaning function >
  # replace the leftovers eg:'\t' > split the string on the new line '\n'
  # character for easier user in building a data frame later > filter out all
  # returned data that has a character length of less than  or equal to 2 >
  # create a data frame with a filtering column in our loop.

  # Note: this is going to get the data for I think 84 websites..so give it a second
  # to run.

  A <- llply(1:nrow(idx_df),function(ii)
    lapply(read_html(idx_df[[2]][[ii]]) %>%
             html_nodes(".gernic_large_text > p") %>%
             html_text %>% txt.safe_text %>%
             stri_replace_all_regex("\t","\n") %>%
             strsplit("\n"),function(i)
               stri_split_regex(i,"  ") %>% unlist %>%
             data.frame(raw_txt = .) %>% filter(nchar(raw_txt)>2) %>%
             mutate(df_idx = 1:length(raw_txt),
                    school_name = idx_df[[1]][[ii]],
                    html_link = idx_df[[3]][[ii]])
    )
  )


  # Above we built a list of data frames...and the rule we know is that any information
  # we are interested in would produce at least two rows of data as we split
  # our raw html on the new line character. This means any data frame in our list
  # with 1 row is non-imporant but was easier to filter out than parse out earlier.
  # So we remove all those data frames with only 1 row.
  CC <- lapply(1:length(A),function(i)A[[i]][mapply(nrow,A[[i]]) == 2] %>% rbind.pages)


  # Helper function for looping through. I shouldn't have used numbers for the column names
  # but i'm just slapping this together.
  # This is going to essentially go through our data frames and transpose the structure
  # so that our final product is a wide data structure rather than a long.

  trans_df <- function(df_in = NULL,i){
    tmp_d <-
      as.data.frame(
        t(c(df_in[[i]][df_in[[i]][[2]] == 2,4][[1]],
            df_in[[i]][df_in[[i]][[2]] == 2,3][[1]],
            df_in[[i]][df_in[[i]][[2]] == 2,1]))
      )

    colnames(tmp_d) <-  c('html_link','school name',df_in[[i]][df_in[[i]][[2]] == 1,1])
    tmp_d
  }


  #  For every index in our list we're going to transpose our structures
  # And do some text cleaning and splitting
  all_dat <- ldply(1:length(CC),function(i)trans_df(df_in = CC,i)) %>%
    mutate(short_name = stri_extract_first_words(`school name`),
           Cost =
             ifelse(!is.na(Cost),
                    stri_extract_first_words(Cost),
                    'Not Listed')
           ) %>% mutate(program =
                   stri_replace_all_regex(
                     `school name`,
                     paste0('(',short_name,'| - )'),"") %>%
                   stri_trim_both) %>%
    mutate(next_session = as.Date(strptime(`Next Start Date`,"%b. %d,%Y"))) %>%
    mutate(Cost = as.numeric(gsub(",","",Cost))) %>% 
  select(html_link,
         short_name,
         program,
         cost = Cost,
         credit_hours = `Transfer Credits`,
         next_session,
         total_credits = `Total Quarter Credits`,
         session_length = `Course Length`)

  # Quick thing I noticed on the credit hours. Loop back over and
  # grab only the numeric values
  all_dat$credit_hours <-
    lapply(all_dat$credit_hours,function(i)
      stri_extract_all_regex(i,"[[:digit:]]") %>%
        unlist %>% paste0(collapse = "") %>% as.numeric) %>%
    unlist


  # Should be done
  return(all_dat)
}



rock.apella <- parse.apella()

str(rock.apella)
# 'data.frame':  82 obs. of  8 variables:
# $ html_link     : chr  "<a href=\"http://www.capella.edu/online-degrees/phd-leadership\">PHD - Leadership </a>"| __truncated__ ...
# $ short_name    : chr  "PHD" "PHD" "PHD" "PHD" ...
# $ program       : chr  "Leadership" "Information Technology Education" "General Information Technology" "Information Assurance and Security" ...
# $ cost          : num  4665 4665 4665 4665 4665 ...
# $ credit_hours  : num  32 32 48 32 32 32 32 32 48 32 ...
# $ next_session  : Date, format: "2016-04-11" "2016-04-11" "2016-04-11" "2016-04-11" ...
# $ total_credits : chr  "120 Credits" "120 Credits" "120 Credits" "120 Credits" ...
# $ session_length: chr  "10 weeks" "10 weeks" "10 weeks" "10 weeks" ...

DT::datatable(rock.apella,escape = F, options = list(searchHighlight = TRUE), filter = 'top')

Here's our final output

And the output in jsfiddle https://jsfiddle.net/cbfas/0x37vudv/1/

这篇关于使用 rvest 包跨多个页面抓取内容的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆