R函数正在Webscraper中遍历相同的数据 [英] R function is looping over the same data in webscraper
问题描述
这是我编写的程序
library(rvest)
library(RCurl)
library(XML)
library(stringr)
#Getting the number of Page
getPageNumber <- function(URL){
parsedDocument = read_html(URL)
Sort1 <- html_nodes(parsedDocument, 'div')
Sort2 <- Sort1[which(html_attr(Sort1, "class") == "pageNumbers al-pageNumbers")]
P <- str_count(html_text(Sort2), pattern = " \\d+\r\n")
return(ifelse(length(P) == 0, 0, max(P)))
}
#Getting all articles based off of their DOI
getAllArticles <-function(URL){
parsedDocument = read_html(URL)
Sort1 <- html_nodes(parsedDocument,'div')
Sort2 <- Sort1[which(html_attr(Sort1, "class") == "al-citation-list")]
ArticleDOInumber = trimws(gsub(".*10.1093/dnares/","",html_text(Sort2)))
URL3 <- "https://doi.org/10.1093/dnares/"
URL4 <- paste(URL3, ArticleDOInumber, sep = "")
return(URL4)
}
Title <- function(parsedDocument){
Sort1 <- html_nodes(parsedDocument, 'h1')
Title <- gsub("<h1>\\n|\\n</h1>","",Sort1)
return(Title)
}
#main function with input as parameter year
findURL <- function(year_chosen){
if(year_chosen >= 1994){
noYearURL = glue::glue("https://academic.oup.com/dnaresearch/search-results?rg_IssuePublicationDate=01%2F01%2F{year_chosen}%20TO%2012%2F31%2F{year_chosen}")
pagesURl = "&fl_SiteID=5275&startpage="
URL = paste(noYearURL, pagesURl, sep = "")
#URL is working with parameter year_chosen
Page <- getPageNumber(URL)
Page2 <- 0
while(Page < Page2 | Page != Page2){
Page <- Page2
URL3 <- paste(URL, Page-1, sep = "")
Page2 <- getPageNumber(URL3)
}
R_Data <- data.frame()
for(i in 1:Page){ #0:Page-1
URL2 <- getAllArticles(paste(URL, i, sep = ""))
for(j in 1:(length(URL2))){
parsedDocument <- read_html(URL2[j])
print(URL2[j])
R <- data.frame("Title" = Title(parsedDocument),stringsAsFactors = FALSE)
#R <- data.frame("Title" = Title(parsedDocument), stringsAsFactors = FALSE)
R_Data <- rbind(R_Data, R)
}
}
paste(URL2)
suppressWarnings(write.csv(R_Data, "DNAresearch.csv", row.names = FALSE, sep = "\t"))
#return(R_Data)
} else {
print("The Year you provide is out of range, this journal only contain articles from 2005 to present")
}
}
findURL(2003)
我的代码的输出如下:
[1] "https://doi.org/10.1093/dnares/10.6.249"
[1] "https://doi.org/10.1093/dnares/10.6.263"
[1] "https://doi.org/10.1093/dnares/10.6.277"
[1] "https://doi.org/10.1093/dnares/10.6.229"
[1] "https://doi.org/10.1093/dnares/10.6.239"
[1] "https://doi.org/10.1093/dnares/10.6.287"
[1] "https://doi.org/10.1093/dnares/10.5.221"
[1] "https://doi.org/10.1093/dnares/10.5.203"
[1] "https://doi.org/10.1093/dnares/10.5.213"
[1] "https://doi.org/10.1093/dnares/10.4.137"
[1] "https://doi.org/10.1093/dnares/10.4.147"
[1] "https://doi.org/10.1093/dnares/10.4.167"
[1] "https://doi.org/10.1093/dnares/10.4.181"
[1] "https://doi.org/10.1093/dnares/10.4.155"
[1] "https://doi.org/10.1093/dnares/10.3.115"
[1] "https://doi.org/10.1093/dnares/10.3.85"
[1] "https://doi.org/10.1093/dnares/10.3.123"
[1] "https://doi.org/10.1093/dnares/10.3.129"
[1] "https://doi.org/10.1093/dnares/10.3.97"
[1] "https://doi.org/10.1093/dnares/10.2.59"
[1] "https://doi.org/10.1093/dnares/10.6.249"
[1] "https://doi.org/10.1093/dnares/10.6.263"
我正在尝试使用年作为参数来刮录日记.我已经抓取了一页,但是当我要更改页面时,我的循环就回到页面顶部,并遍历相同的数据.我的代码应该是正确的,而且我不明白为什么会这样.预先谢谢你
I'm trying to scrape a journal with years as a parameter. I've scraped one page, but when I'm supposed to change pages my loop just goes back to the top of the page and loops over the same data. My code should be right and I don't understand why this is happening. Thank you in advance
推荐答案
不是在读取相同的URL.这是因为您选择了错误的节点,而恰巧会产生重复的信息.正如我在上一个问题中提到的那样,您需要重新处理您的 Title
函数.下面的 Title
重新编写将根据类名称和单节点匹配提取实际的文章标题.
It is not that it is reading the same url. It is that you are selecting for the wrong node which happens to yield repeating info. As I mentioned in your last question, you need to re-work your Title
function. The Title
re-write below will extract the actual article title based on class name and single node match.
请注意,您删除了 sep
参数.代码中还有一些其他方面看起来可能可以在逻辑上进行简化.
Please note the removal of your sep
arg. There are also some other areas of the code that look like they probably could be simplified in terms of logic.
标题功能:
Title <- function(parsedDocument) {
Title <- parsedDocument %>%
html_node(".article-title-main") %>%
html_text() %>%
gsub("\\r\\n\\s+", "", .) %>%
trimws(.)
return(Title)
}
R:
library(rvest)
library(XML)
library(stringr)
# Getting the number of Page
getPageNumber <- function(URL) {
# print(URL)
parsedDocument <- read_html(URL)
Sort1 <- html_nodes(parsedDocument, "div")
Sort2 <- Sort1[which(html_attr(Sort1, "class") == "pagination al-pagination")]
P <- str_count(html_text(Sort2), pattern = " \\d+\r\n")
return(ifelse(length(P) == 0, 0, max(P)))
}
# Getting all articles based off of their DOI
getAllArticles <- function(URL) {
print(URL)
parsedDocument <- read_html(URL)
Sort1 <- html_nodes(parsedDocument, "div")
Sort2 <- Sort1[which(html_attr(Sort1, "class") == "al-citation-list")]
ArticleDOInumber <- trimws(gsub(".*10.1093/dnares/", "", html_text(Sort2)))
URL3 <- "https://doi.org/10.1093/dnares/"
URL4 <- paste(URL3, ArticleDOInumber, sep = "")
return(URL4)
}
Title <- function(parsedDocument) {
Title <- parsedDocument %>%
html_node(".article-title-main") %>%
html_text() %>%
gsub("\\r\\n\\s+", "", .) %>%
trimws(.)
return(Title)
}
# main function with input as parameter year
findURL <- function(year_chosen) {
if (year_chosen >= 1994) {
noYearURL <- glue::glue("https://academic.oup.com/dnaresearch/search-results?rg_IssuePublicationDate=01%2F01%2F{year_chosen}%20TO%2012%2F31%2F{year_chosen}")
pagesURl <- "&fl_SiteID=5275&page="
URL <- paste(noYearURL, pagesURl, sep = "")
# URL is working with parameter year_chosen
Page <- getPageNumber(URL)
if (Page == 5) {
Page2 <- 0
while (Page < Page2 | Page != Page2) {
Page <- Page2
URL3 <- paste(URL, Page - 1, sep = "")
Page2 <- getPageNumber(URL3)
}
}
R_Data <- data.frame()
for (i in 1:Page) {
URL2 <- getAllArticles(paste(URL, i, sep = ""))
for (j in 1:(length(URL2))) {
parsedDocument <- read_html(URL2[j])
#print(URL2[j])
#print(Title(parsedDocument))
R <- data.frame("Title" = Title(parsedDocument), stringsAsFactors = FALSE)
#print(R)
R_Data <- rbind(R_Data, R)
}
}
write.csv(R_Data, "Group4.csv", row.names = FALSE)
} else {
print("The Year you provide is out of range, this journal only contain articles from 2005 to present")
}
}
findURL(2003)
这篇关于R函数正在Webscraper中遍历相同的数据的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!