R函数正在Webscraper中遍历相同的数据 [英] R function is looping over the same data in webscraper

查看:33
本文介绍了R函数正在Webscraper中遍历相同的数据的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

这是我编写的程序

    library(rvest)
    library(RCurl)
    library(XML)
    library(stringr)


    #Getting the number of Page
    getPageNumber <- function(URL){
      parsedDocument = read_html(URL)
      Sort1 <- html_nodes(parsedDocument, 'div')
      Sort2 <- Sort1[which(html_attr(Sort1, "class") == "pageNumbers al-pageNumbers")] 
      P <- str_count(html_text(Sort2), pattern = " \\d+\r\n")
      return(ifelse(length(P) == 0, 0, max(P)))
    }


    #Getting all articles based off of their DOI
    getAllArticles <-function(URL){
      parsedDocument = read_html(URL)
      Sort1 <- html_nodes(parsedDocument,'div')
      Sort2 <-  Sort1[which(html_attr(Sort1, "class") == "al-citation-list")]
      ArticleDOInumber = trimws(gsub(".*10.1093/dnares/","",html_text(Sort2)))
      URL3 <- "https://doi.org/10.1093/dnares/"
      URL4 <- paste(URL3, ArticleDOInumber, sep = "")
      return(URL4)
    }


    Title <- function(parsedDocument){
      Sort1 <- html_nodes(parsedDocument, 'h1')
      Title <- gsub("<h1>\\n|\\n</h1>","",Sort1)
      return(Title)
    }


    #main function with input as parameter year
    findURL <- function(year_chosen){
      if(year_chosen >= 1994){
      noYearURL = glue::glue("https://academic.oup.com/dnaresearch/search-results?rg_IssuePublicationDate=01%2F01%2F{year_chosen}%20TO%2012%2F31%2F{year_chosen}")
      pagesURl = "&fl_SiteID=5275&startpage="
      URL = paste(noYearURL, pagesURl, sep = "")
      #URL is working with parameter year_chosen
      Page <- getPageNumber(URL)
      

      Page2 <- 0
      while(Page < Page2 | Page != Page2){
        Page <- Page2
        URL3 <- paste(URL, Page-1, sep = "")
        Page2 <- getPageNumber(URL3)    
      }
      R_Data <- data.frame()
      for(i in 1:Page){ #0:Page-1
        URL2 <- getAllArticles(paste(URL, i, sep = ""))
        for(j in 1:(length(URL2))){
          parsedDocument <- read_html(URL2[j])
          print(URL2[j])
          R <- data.frame("Title" = Title(parsedDocument),stringsAsFactors = FALSE)
          #R <- data.frame("Title" = Title(parsedDocument), stringsAsFactors = FALSE)
          R_Data <- rbind(R_Data, R)
        } 
      }
      paste(URL2)
      suppressWarnings(write.csv(R_Data, "DNAresearch.csv", row.names = FALSE, sep = "\t"))
      #return(R_Data)
      } else {
        print("The Year you provide is out of range, this journal only contain articles from 2005 to present")
      }
    }

    findURL(2003)

我的代码的输出如下:

[1] "https://doi.org/10.1093/dnares/10.6.249"
[1] "https://doi.org/10.1093/dnares/10.6.263"
[1] "https://doi.org/10.1093/dnares/10.6.277"
[1] "https://doi.org/10.1093/dnares/10.6.229"
[1] "https://doi.org/10.1093/dnares/10.6.239"
[1] "https://doi.org/10.1093/dnares/10.6.287"
[1] "https://doi.org/10.1093/dnares/10.5.221"
[1] "https://doi.org/10.1093/dnares/10.5.203"
[1] "https://doi.org/10.1093/dnares/10.5.213"
[1] "https://doi.org/10.1093/dnares/10.4.137"
[1] "https://doi.org/10.1093/dnares/10.4.147"
[1] "https://doi.org/10.1093/dnares/10.4.167"
[1] "https://doi.org/10.1093/dnares/10.4.181"
[1] "https://doi.org/10.1093/dnares/10.4.155"
[1] "https://doi.org/10.1093/dnares/10.3.115"
[1] "https://doi.org/10.1093/dnares/10.3.85"
[1] "https://doi.org/10.1093/dnares/10.3.123"
[1] "https://doi.org/10.1093/dnares/10.3.129"
[1] "https://doi.org/10.1093/dnares/10.3.97"
[1] "https://doi.org/10.1093/dnares/10.2.59"
[1] "https://doi.org/10.1093/dnares/10.6.249"
[1] "https://doi.org/10.1093/dnares/10.6.263"

我正在尝试使用年作为参数来刮录日记.我已经抓取了一页,但是当我要更改页面时,我的循环就回到页面顶部,并遍历相同的数据.我的代码应该是正确的,而且我不明白为什么会这样.预先谢谢你

I'm trying to scrape a journal with years as a parameter. I've scraped one page, but when I'm supposed to change pages my loop just goes back to the top of the page and loops over the same data. My code should be right and I don't understand why this is happening. Thank you in advance

推荐答案

不是在读取相同的URL.这是因为您选择了错误的节点,而恰巧会产生重复的信息.正如我在上一个问题中提到的那样,您需要重新处理您的 Title 函数.下面的 Title 重新编写将根据类名称和单节点匹配提取实际的文章标题.

It is not that it is reading the same url. It is that you are selecting for the wrong node which happens to yield repeating info. As I mentioned in your last question, you need to re-work your Title function. The Title re-write below will extract the actual article title based on class name and single node match.

请注意,您删除了 sep 参数.代码中还有一些其他方面看起来可能可以在逻辑上进行简化.

Please note the removal of your sep arg. There are also some other areas of the code that look like they probably could be simplified in terms of logic.

标题功能:

Title <- function(parsedDocument) {
  Title <- parsedDocument %>%
    html_node(".article-title-main") %>%
    html_text() %>%
    gsub("\\r\\n\\s+", "", .) %>%
    trimws(.)
  return(Title)
}


R:

library(rvest)
library(XML)
library(stringr)


# Getting the number of Page
getPageNumber <- function(URL) {
  # print(URL)
  parsedDocument <- read_html(URL)
  Sort1 <- html_nodes(parsedDocument, "div")
  Sort2 <- Sort1[which(html_attr(Sort1, "class") == "pagination al-pagination")]
  P <- str_count(html_text(Sort2), pattern = " \\d+\r\n")
  return(ifelse(length(P) == 0, 0, max(P)))
}

# Getting all articles based off of their DOI
getAllArticles <- function(URL) {
  print(URL)
  parsedDocument <- read_html(URL)
  Sort1 <- html_nodes(parsedDocument, "div")
  Sort2 <- Sort1[which(html_attr(Sort1, "class") == "al-citation-list")]
  ArticleDOInumber <- trimws(gsub(".*10.1093/dnares/", "", html_text(Sort2)))
  URL3 <- "https://doi.org/10.1093/dnares/"
  URL4 <- paste(URL3, ArticleDOInumber, sep = "")
  return(URL4)
}


Title <- function(parsedDocument) {
  Title <- parsedDocument %>%
    html_node(".article-title-main") %>%
    html_text() %>%
    gsub("\\r\\n\\s+", "", .) %>%
    trimws(.)
  return(Title)
}


# main function with input as parameter year
findURL <- function(year_chosen) {
  if (year_chosen >= 1994) {
    noYearURL <- glue::glue("https://academic.oup.com/dnaresearch/search-results?rg_IssuePublicationDate=01%2F01%2F{year_chosen}%20TO%2012%2F31%2F{year_chosen}")
    pagesURl <- "&fl_SiteID=5275&page="
    URL <- paste(noYearURL, pagesURl, sep = "")
    # URL is working with parameter year_chosen
    Page <- getPageNumber(URL)


    if (Page == 5) {
      Page2 <- 0
      while (Page < Page2 | Page != Page2) {
        Page <- Page2
        URL3 <- paste(URL, Page - 1, sep = "")
        Page2 <- getPageNumber(URL3)
      }
    }
    R_Data <- data.frame()
    for (i in 1:Page) {
      URL2 <- getAllArticles(paste(URL, i, sep = ""))
      for (j in 1:(length(URL2))) {
        parsedDocument <- read_html(URL2[j])
        #print(URL2[j])
        #print(Title(parsedDocument))
        R <- data.frame("Title" = Title(parsedDocument), stringsAsFactors = FALSE)
        #print(R)
        R_Data <- rbind(R_Data, R)
      }
    }
    write.csv(R_Data, "Group4.csv", row.names = FALSE)
  } else {
    print("The Year you provide is out of range, this journal only contain articles from 2005 to present")
  }
}

findURL(2003)

这篇关于R函数正在Webscraper中遍历相同的数据的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆