R 程序未输出正确的抓取日志条目 [英] R program is not outputting the correct scraped journal entries
问题描述
library(rvest)
library(RCurl)
library(XML)
library(stringr)
#Getting the number of Page
getPageNumber <- function(URL) {
# print(URL)
parsedDocument <- read_html(URL)
pageNumber <- parsedDocument %>%
html_node(".al-currentPage + a:last-child") %>%
html_text() %>%
as.integer()
return(pageNumber)
}
#Getting all articles based off of their DOI
getAllArticles <-function(URL){
parsedDocument = read_html(URL)
findLocationDiv <- html_nodes(parsedDocument,'div')
foundClass <- findLocationDiv[which(html_attr(findLocationDiv, "class") == "al-citation-list")]
ArticleDOInumber = trimws(gsub(".*10.1093/dnares/","",html_text(foundClass)))
DOImain <- "https://doi.org/10.1093/dnares/"
fullDOI <- paste(DOImain, ArticleDOInumber, sep = "")
return(fullDOI)
}
CorrespondingAuthors <- function(parsedDocument){
CorrespondingAuthors <- parsedDocument %>%
html_node("a.linked-name js-linked-name-trigger") %>%
html_text() %>%
return(CorrespondingAuthors)
}
CoAuthorEmail <- function(parsedDocument){
CoAuthorEmail <- parsedDocument %>%
html_node(".icon-general-mail") %>%
html_text() %>%
return(CoAuthorEmail)
}
FullText <- function(parsedDocument){
FullText <- parsedDocument %>%
html_node('.PdfOnlyLink .article-pdfLink') %>% html_attr('href')
return(FullText)
}
#main function with input as parameter year
findURL <- function(year_chosen){
if (year_chosen >= 1994) {
noYearURL <- glue::glue("https://academic.oup.com/dnaresearch/search-results?rg_IssuePublicationDate=01%2F01%2F{year_chosen}%20TO%2012%2F31%2F{year_chosen}")
pagesURl <- "&fl_SiteID=5275&page="
URL <- paste(noYearURL, pagesURl, sep = "")
# URL is working with parameter year_chosen
firstPage <- getPageNumber(URL)
if (firstPage == 5) {
nextPage <- 0
while (firstPage < nextPage | firstPage != nextPage) {
firstPage <- nextPage
URLwithPageNum <- paste(URL, firstPage-1, sep = "")
nextPage <- getPageNumber(URLwithPageNum)
}
}
DNAresearch <- data.frame()
for (i in 1:firstPage) {
URLallArticles <- getAllArticles(paste(URL, i, sep = ""))
for (j in 1:(length(URLallArticles))) {
parsedDocument <- read_html(URLallArticles[j])
#"Title" = Title(parsedDocument),"Authors" = Authors(parsedDocument),"Author Affiliations" = AuthorAffil(parsedDocument),"Corresponding Authors" CorrespondingAuthors=(parsedDocument),"CoAuthor Email" = CoAuthorEmail(parsedDocument),"Publication Date" = PublicationDate(parsedDocument),"Keywords" = Keywords(parsedDocument),"Abstract" = Abstract(parsedDocument), "Full Text" = FullText(parsedDocument)
allData <- data.frame("Corresponding Authors" = (parsedDocument),"CoAuthor Email" = CoAuthorEmail(parsedDocument),"Full Text" = FullText(parsedDocument),stringsAsFactors = FALSE)
#for(i in 1:allData == "NA"){
#i == "NO"
#}
DNAresearch <- rbind(DNAresearch, allData)
}
}
write.csv(DNAresearch, "DNAresearch.csv", row.names = FALSE)
} else {
print("The Year you provide is out of range, this journal only contain articles from 2005 to present")
}
}
##################### Main function test
findURL(1994)
在上面的程序中,我正在从网站上抓取期刊.然后输出在名为 DNAresearch.csv 的 csv 文件中.我有三件事需要解决.
In the program above I am scraping journals from a website. The output is then on a csv file named DNAresearch. I have three things that need to be fixed.
在 CorrespondingAuthors 中,我不断获得该期刊的第一作者.我实际上需要除了第一作者之外的所有作者.
In CorrespondingAuthors I keep getting the first author of the journal. I actually need all of the authors other than the the first author.
在 CoAuthorEmail 中,我找不到作者的电子邮件,因此在 csv 文件中它返回 NA.它应该输出 NA ,因为我相信没有引用电子邮件,但是我希望 CSV 文件返回 NO 而不是 NA.
In CoAuthorEmail I cannot find the authors emails so in the csv file it returns NA. It should output NA , as I believe the email is not referenced, however I would like the CSV file to return NO instead of NA.
在 FullText 中,我试图获取期刊的全文.全文必须通过pdf链接进行抓取.我的 csv 目前返回 NA .
In FullText I am trying to get the full text of the journal. The full text has to be scraped through a pdf link. My csv currently returns NA .
一切都是正确的,但我上面的三个问题.预先感谢您的帮助!
Everything is correct, but the three issues I have above. Thank you in advance for the help!
推荐答案
这是一个不完整的答案,只是比将所有这些放入评论中要容易得多:
This is an incomplete answer, it is just easier to than fitting all of this into a comment:
为了返回多个节点而不是仅返回第一个节点.您需要使用html_nodes";与 s.这将返回所有节点,但缺点是如果节点丢失,函数返回零长度向量.所以如果你确定有作者,那应该是有问题的
In order to return more than one node instead of the just the first node. You need to use "html_nodes" with the s. This will return all of the nodes, but has the disadvantage is if the node is missing the function returns a zero length vector. So if you are sure has an author, then it should be a problem
CorrespondingAuthors <- function(parsedDocument){
CorrespondingAuthors <- parsedDocument %>%
html_nodes("a.linked-name js-linked-name-trigger") %>%
html_text()
#probably need to add: CorrespondingAuthors <- paste(CorrespondingAuthor, collapse =", ")
return(CorrespondingAuthors)
}
NA"和NA"之间是有区别的.和NA.第一个只是N和A的字符串.要检查不可用的NA,最好使用is.na()
函数.
有多种方法可以下载 PDF 文件并提取内容.最好回答一个严格关注该问题的新问题.更有可能得到解答,成为未来更有用的资源.
There are ways to download PDF files and extract the contents. It is best to answer a new question that is strictly focus on that issue. It is more likely to get answered and be a more useful resources in the future.
更新
基于这里评论中的提供链接是一个有效的通讯作者和作者电子邮件
UPDATE
Based on the provide link in the comments here is a working CorrespondingAuthors and AuthorEmail
url <- "https://academic.oup.com/dnaresearch/article/25/6/655/5123538?searchresult=1"
page <- read_html(url)
CorrespondingAuthors <- function(parsedDocument){
CorrespondingAuthors <- parsedDocument %>%
html_nodes("a.linked-name") %>%
html_text()
#Comma separate string of names
CorrespondingAuthors <- paste(CorrespondingAuthors, collapse =", ")
# Comment the above line for a vector names
return(CorrespondingAuthors)
}
CoAuthorEmail <- function(parsedDocument){
CoAuthorEmail <- parsedDocument %>%
html_node("div.info-author-correspondence a") %>%
html_text()
CoAuthorEmail <- ifelse(is.na(CoAuthorEmail), "No", CoAuthorEmail)
return(CoAuthorEmail)
}
这篇关于R 程序未输出正确的抓取日志条目的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!