R 程序未输出正确的抓取日志条目 [英] R program is not outputting the correct scraped journal entries

查看:38
本文介绍了R 程序未输出正确的抓取日志条目的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

library(rvest)
library(RCurl)
library(XML)
library(stringr)


#Getting the number of Page
getPageNumber <- function(URL) {
  # print(URL)
  parsedDocument <- read_html(URL)
  pageNumber <- parsedDocument %>%
    html_node(".al-currentPage + a:last-child") %>%
    html_text() %>%
    as.integer()
  return(pageNumber)
}


#Getting all articles based off of their DOI
getAllArticles <-function(URL){
  parsedDocument = read_html(URL)
  findLocationDiv <- html_nodes(parsedDocument,'div')
  foundClass <-  findLocationDiv[which(html_attr(findLocationDiv, "class") == "al-citation-list")]
  ArticleDOInumber = trimws(gsub(".*10.1093/dnares/","",html_text(foundClass)))
  DOImain <- "https://doi.org/10.1093/dnares/"
  fullDOI <- paste(DOImain, ArticleDOInumber, sep = "")
  return(fullDOI)
}

CorrespondingAuthors <- function(parsedDocument){
  CorrespondingAuthors <- parsedDocument %>%
    html_node("a.linked-name js-linked-name-trigger") %>%
    html_text() %>%
    return(CorrespondingAuthors)
}

CoAuthorEmail <- function(parsedDocument){
  CoAuthorEmail <- parsedDocument %>%
    html_node(".icon-general-mail") %>%
    html_text() %>%
    return(CoAuthorEmail)
}
FullText <- function(parsedDocument){
  FullText <- parsedDocument %>%
    html_node('.PdfOnlyLink .article-pdfLink') %>% html_attr('href')
    return(FullText)
}

#main function with input as parameter year
findURL <- function(year_chosen){
  if (year_chosen >= 1994) {
    noYearURL <- glue::glue("https://academic.oup.com/dnaresearch/search-results?rg_IssuePublicationDate=01%2F01%2F{year_chosen}%20TO%2012%2F31%2F{year_chosen}")
    pagesURl <- "&fl_SiteID=5275&page="
    URL <- paste(noYearURL, pagesURl, sep = "")
    # URL is working with parameter year_chosen
    firstPage <- getPageNumber(URL)
    
    if (firstPage == 5) {
      nextPage <- 0
      while (firstPage < nextPage | firstPage != nextPage) {
        firstPage <- nextPage
        URLwithPageNum <- paste(URL, firstPage-1, sep = "")
        nextPage <- getPageNumber(URLwithPageNum)
      }
    }
  DNAresearch <- data.frame()
    for (i in 1:firstPage) {
      URLallArticles <- getAllArticles(paste(URL, i, sep = ""))
      for (j in 1:(length(URLallArticles))) {
        parsedDocument <- read_html(URLallArticles[j])
        #"Title" = Title(parsedDocument),"Authors" = Authors(parsedDocument),"Author Affiliations" = AuthorAffil(parsedDocument),"Corresponding  Authors" CorrespondingAuthors=(parsedDocument),"CoAuthor Email" = CoAuthorEmail(parsedDocument),"Publication Date" = PublicationDate(parsedDocument),"Keywords" = Keywords(parsedDocument),"Abstract" = Abstract(parsedDocument), "Full Text" = FullText(parsedDocument)
        allData <- data.frame("Corresponding Authors" = (parsedDocument),"CoAuthor Email" = CoAuthorEmail(parsedDocument),"Full Text" = FullText(parsedDocument),stringsAsFactors = FALSE)
        #for(i in 1:allData == "NA"){
          #i == "NO"
        #}
        DNAresearch <- rbind(DNAresearch, allData)
      }
    }
    write.csv(DNAresearch, "DNAresearch.csv", row.names = FALSE)
  } else {
    print("The Year you provide is out of range, this journal only contain articles from 2005 to present")
  }
}

##################### Main function test
findURL(1994)

在上面的程序中,我正在从网站上抓取期刊.然后输出在名为 DNAresearch.csv 的 csv 文件中.我有三件事需要解决.

In the program above I am scraping journals from a website. The output is then on a csv file named DNAresearch. I have three things that need to be fixed.

  1. 在 CorrespondingAuthors 中,我不断获得该期刊的第一作者.我实际上需要除了第一作者之外的所有作者.

  1. In CorrespondingAuthors I keep getting the first author of the journal. I actually need all of the authors other than the the first author.

在 CoAuthorEmail 中,我找不到作者的电子邮件,因此在 csv 文件中它返回 NA.它应该输出 NA ,因为我相信没有引用电子邮件,但是我希望 CSV 文件返回 NO 而不是 NA.

In CoAuthorEmail I cannot find the authors emails so in the csv file it returns NA. It should output NA , as I believe the email is not referenced, however I would like the CSV file to return NO instead of NA.

在 FullText 中,我试图获取期刊的全文.全文必须通过pdf链接进行抓取.我的 csv 目前返回 NA .

In FullText I am trying to get the full text of the journal. The full text has to be scraped through a pdf link. My csv currently returns NA .

一切都是正确的,但我上面的三个问题.预先感谢您的帮助!

Everything is correct, but the three issues I have above. Thank you in advance for the help!

推荐答案

这是一个不完整的答案,只是比将所有这些放入评论中要容易得多:

This is an incomplete answer, it is just easier to than fitting all of this into a comment:

  1. 为了返回多个节点而不是仅返回第一个节点.您需要使用html_nodes";与 s.这将返回所有节点,但缺点是如果节点丢失,函数返回零长度向量.所以如果你确定有作者,那应该是有问题的

  1. In order to return more than one node instead of the just the first node. You need to use "html_nodes" with the s. This will return all of the nodes, but has the disadvantage is if the node is missing the function returns a zero length vector. So if you are sure has an author, then it should be a problem

CorrespondingAuthors <- function(parsedDocument){
  CorrespondingAuthors <- parsedDocument %>%
  html_nodes("a.linked-name js-linked-name-trigger") %>%
  html_text() 
  #probably need to add: CorrespondingAuthors  <- paste(CorrespondingAuthor, collapse =", ")
 return(CorrespondingAuthors)
}

  • NA"和NA"之间是有区别的.和NA.第一个只是N和A的字符串.要检查不可用的NA,最好使用is.na()函数.

    有多种方法可以下载 PDF 文件并提取内容.最好回答一个严格关注该问题的新问题.更有可能得到解答,成为未来更有用的资源.

    There are ways to download PDF files and extract the contents. It is best to answer a new question that is strictly focus on that issue. It is more likely to get answered and be a more useful resources in the future.

    更新
    基于这里评论中的提供链接是一个有效的通讯作者和作者电子邮件

    UPDATE
    Based on the provide link in the comments here is a working CorrespondingAuthors and AuthorEmail

    url <- "https://academic.oup.com/dnaresearch/article/25/6/655/5123538?searchresult=1"
    page <- read_html(url)
    
        CorrespondingAuthors <- function(parsedDocument){
           CorrespondingAuthors <- parsedDocument %>%
              html_nodes("a.linked-name") %>%
              html_text() 
           #Comma separate string of names
           CorrespondingAuthors  <- paste(CorrespondingAuthors, collapse =", ")
           # Comment the above line for a vector names
           return(CorrespondingAuthors)
        }
        
        
       CoAuthorEmail <- function(parsedDocument){
          CoAuthorEmail <- parsedDocument %>%
               html_node("div.info-author-correspondence a") %>%
               html_text() 
          CoAuthorEmail <- ifelse(is.na(CoAuthorEmail), "No", CoAuthorEmail)
          return(CoAuthorEmail)
       }
    

    这篇关于R 程序未输出正确的抓取日志条目的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

  • 查看全文
    登录 关闭
    扫码关注1秒登录
    发送“验证码”获取 | 15天全站免登陆