R 程序未输出正确的抓取日志条目 [英] R program is not outputting the correct scraped journal entries

查看：38 发布时间：2021/6/13 19:31:46 r loops pdf web-scraping output

本文介绍了R 程序未输出正确的抓取日志条目的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

library(rvest)
library(RCurl)
library(XML)
library(stringr)


#Getting the number of Page
getPageNumber <- function(URL) {
  # print(URL)
  parsedDocument <- read_html(URL)
  pageNumber <- parsedDocument %>%
    html_node(".al-currentPage + a:last-child") %>%
    html_text() %>%
    as.integer()
  return(pageNumber)
}


#Getting all articles based off of their DOI
getAllArticles <-function(URL){
  parsedDocument = read_html(URL)
  findLocationDiv <- html_nodes(parsedDocument,'div')
  foundClass <-  findLocationDiv[which(html_attr(findLocationDiv, "class") == "al-citation-list")]
  ArticleDOInumber = trimws(gsub(".*10.1093/dnares/","",html_text(foundClass)))
  DOImain <- "https://doi.org/10.1093/dnares/"
  fullDOI <- paste(DOImain, ArticleDOInumber, sep = "")
  return(fullDOI)
}

CorrespondingAuthors <- function(parsedDocument){
  CorrespondingAuthors <- parsedDocument %>%
    html_node("a.linked-name js-linked-name-trigger") %>%
    html_text() %>%
    return(CorrespondingAuthors)
}

CoAuthorEmail <- function(parsedDocument){
  CoAuthorEmail <- parsedDocument %>%
    html_node(".icon-general-mail") %>%
    html_text() %>%
    return(CoAuthorEmail)
}
FullText <- function(parsedDocument){
  FullText <- parsedDocument %>%
    html_node('.PdfOnlyLink .article-pdfLink') %>% html_attr('href')
    return(FullText)
}

#main function with input as parameter year
findURL <- function(year_chosen){
  if (year_chosen >= 1994) {
    noYearURL <- glue::glue("https://academic.oup.com/dnaresearch/search-results?rg_IssuePublicationDate=01%2F01%2F{year_chosen}%20TO%2012%2F31%2F{year_chosen}")
    pagesURl <- "&fl_SiteID=5275&page="
    URL <- paste(noYearURL, pagesURl, sep = "")
    # URL is working with parameter year_chosen
    firstPage <- getPageNumber(URL)
    
    if (firstPage == 5) {
      nextPage <- 0
      while (firstPage < nextPage | firstPage != nextPage) {
        firstPage <- nextPage
        URLwithPageNum <- paste(URL, firstPage-1, sep = "")
        nextPage <- getPageNumber(URLwithPageNum)
      }
    }
  DNAresearch <- data.frame()
    for (i in 1:firstPage) {
      URLallArticles <- getAllArticles(paste(URL, i, sep = ""))
      for (j in 1:(length(URLallArticles))) {
        parsedDocument <- read_html(URLallArticles[j])
        #"Title" = Title(parsedDocument),"Authors" = Authors(parsedDocument),"Author Affiliations" = AuthorAffil(parsedDocument),"Corresponding  Authors" CorrespondingAuthors=(parsedDocument),"CoAuthor Email" = CoAuthorEmail(parsedDocument),"Publication Date" = PublicationDate(parsedDocument),"Keywords" = Keywords(parsedDocument),"Abstract" = Abstract(parsedDocument), "Full Text" = FullText(parsedDocument)
        allData <- data.frame("Corresponding Authors" = (parsedDocument),"CoAuthor Email" = CoAuthorEmail(parsedDocument),"Full Text" = FullText(parsedDocument),stringsAsFactors = FALSE)
        #for(i in 1:allData == "NA"){
          #i == "NO"
        #}
        DNAresearch <- rbind(DNAresearch, allData)
      }
    }
    write.csv(DNAresearch, "DNAresearch.csv", row.names = FALSE)
  } else {
    print("The Year you provide is out of range, this journal only contain articles from 2005 to present")
  }
}

##################### Main function test
findURL(1994)

在上面的程序中，我正在从网站上抓取期刊.然后输出在名为 DNAresearch.csv 的 csv 文件中.我有三件事需要解决.

In the program above I am scraping journals from a website. The output is then on a csv file named DNAresearch. I have three things that need to be fixed.

在 CorrespondingAuthors 中，我不断获得该期刊的第一作者.我实际上需要除了第一作者之外的所有作者.

In CorrespondingAuthors I keep getting the first author of the journal. I actually need all of the authors other than the the first author.

在 CoAuthorEmail 中，我找不到作者的电子邮件，因此在 csv 文件中它返回 NA.它应该输出 NA ，因为我相信没有引用电子邮件，但是我希望 CSV 文件返回 NO 而不是 NA.

In CoAuthorEmail I cannot find the authors emails so in the csv file it returns NA. It should output NA , as I believe the email is not referenced, however I would like the CSV file to return NO instead of NA.

在 FullText 中，我试图获取期刊的全文.全文必须通过pdf链接进行抓取.我的 csv 目前返回 NA .

In FullText I am trying to get the full text of the journal. The full text has to be scraped through a pdf link. My csv currently returns NA .

一切都是正确的，但我上面的三个问题.预先感谢您的帮助！

Everything is correct, but the three issues I have above. Thank you in advance for the help!

推荐答案

这是一个不完整的答案，只是比将所有这些放入评论中要容易得多:

This is an incomplete answer, it is just easier to than fitting all of this into a comment:

为了返回多个节点而不是仅返回第一个节点.您需要使用html_nodes"；与 s.这将返回所有节点，但缺点是如果节点丢失，函数返回零长度向量.所以如果你确定有作者，那应该是有问题的

In order to return more than one node instead of the just the first node. You need to use "html_nodes" with the s. This will return all of the nodes, but has the disadvantage is if the node is missing the function returns a zero length vector. So if you are sure has an author, then it should be a problem

CorrespondingAuthors <- function(parsedDocument){
  CorrespondingAuthors <- parsedDocument %>%
  html_nodes("a.linked-name js-linked-name-trigger") %>%
  html_text() 
  #probably need to add: CorrespondingAuthors  <- paste(CorrespondingAuthor, collapse =", ")
 return(CorrespondingAuthors)
}

NA"和NA"之间是有区别的.和NA.第一个只是N和A的字符串.要检查不可用的NA，最好使用is.na()函数.

有多种方法可以下载 PDF 文件并提取内容.最好回答一个严格关注该问题的新问题.更有可能得到解答，成为未来更有用的资源.

There are ways to download PDF files and extract the contents. It is best to answer a new question that is strictly focus on that issue. It is more likely to get answered and be a more useful resources in the future.

更新
基于这里评论中的提供链接是一个有效的通讯作者和作者电子邮件

UPDATE
Based on the provide link in the comments here is a working CorrespondingAuthors and AuthorEmail

url <- "https://academic.oup.com/dnaresearch/article/25/6/655/5123538?searchresult=1"
page <- read_html(url)

    CorrespondingAuthors <- function(parsedDocument){
       CorrespondingAuthors <- parsedDocument %>%
          html_nodes("a.linked-name") %>%
          html_text() 
       #Comma separate string of names
       CorrespondingAuthors  <- paste(CorrespondingAuthors, collapse =", ")
       # Comment the above line for a vector names
       return(CorrespondingAuthors)
    }
    
    
   CoAuthorEmail <- function(parsedDocument){
      CoAuthorEmail <- parsedDocument %>%
           html_node("div.info-author-correspondence a") %>%
           html_text() 
      CoAuthorEmail <- ifelse(is.na(CoAuthorEmail), "No", CoAuthorEmail)
      return(CoAuthorEmail)
   }

这篇关于R 程序未输出正确的抓取日志条目的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

R 程序未输出正确的抓取日志条目 [英] R program is not outputting the correct scraped journal entries

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

R 程序未输出正确的抓取日志条目 [英] R program is not outputting the correct scraped journal entries

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭