R webscraper 没有在一行中输出 pdf 文本 [英] R webscraper is not outputting pdf text in one row
问题描述
我一直在网上从牛津期刊中抓取 R 语言的文章,并想获取特定文章的全文.所有文章都有指向它们的 pdf 链接,所以我一直在尝试拉出 pdf 链接并将整个文本刮到 csv 上.全文应全部放入 1 行,但 csv 文件中的输出显示一篇 11 行的文章.我该如何解决这个问题?
I've been web scraping articles in R from the Oxford journals and want to grab the full text of specific articles. All articles have a pdf link to them so I've been trying to pull the pdf link and scrape the entire text onto a csv. The full text should all fit into 1 row however the output in the csv file shows one article of 11 rows. How can I fix this issue?
代码如下:
####install.packages("rvest")
library(rvest)
library(RCurl)
library(XML)
library(stringr)
#for Fulltext to read pdf
####install.packages("pdftools")
library(pdftools)
fullText <- function(parsedDocument){
endLink <- parsedDocument %>%
html_node('.article-pdfLink') %>% html_attr('href')
frontLink <- "https://academic.oup.com"
#link of pdf
pdfLink <- paste(frontLink,endLink,sep = "")
#extract full text from pdfLink
pdfFullText <- pdf_text(pdfLink)
fulltext <- paste(pdfFullText, sep = "\n")
return(fulltext)
}
#############################################
#main function with input as parameter year
testFullText <- function(DOIurl){
parsedDocument <- read_html(DOIurl)
DNAresearch <- data.frame()
allData <- data.frame("Full Text" = fullText(parsedDocument), stringsAsFactors = FALSE)
DNAresearch <- rbind(DNAresearch, allData)
write.csv(DNAresearch, "DNAresearch.csv", row.names = FALSE)
}
testFullText("https://doi.org/10.1093/dnares/dsm026")
推荐答案
看你的最后一个函数,如果我没理解错,你想把 url 和所有的文本刮到一个数据框/tibble 中然后导出到一个csv.这是仅用 1 篇文章就可以做到的方法,并且您应该能够通过一些操作来循环浏览一些链接(如果我误解了,请道歉):
Looking at your last function, if I understand correctly, you want to take the url and scrape all the text into the a data frame/tibble and then export it to a csv. Here is how you can do it with just 1 article, and you should be able to loop through some links with a little manipulation (apologies if I am misunderstanding):
library(tidyverse)
library(rvest)
# read in html link
document_link <- read_html("https://doi.org/10.1093/dnares/dsm026")
# get the text, and put it into a tibble with only 1 row
text_tibble <- document_link %>%
html_nodes('.chapter-para') %>%
html_text() %>%
as_tibble() %>%
summarize(full_text = paste(value, collapse = " ")) ## this will collpase to 1 row
# now write to csv
## write_csv(text_tibble, file = "")
这篇关于R webscraper 没有在一行中输出 pdf 文本的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!