R webscraper 没有在一行中输出 pdf 文本 [英] R webscraper is not outputting pdf text in one row

查看:26
本文介绍了R webscraper 没有在一行中输出 pdf 文本的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我一直在网上从牛津期刊中抓取 R 语言的文章,并想获取特定文章的全文.所有文章都有指向它们的 pdf 链接,所以我一直在尝试拉出 pdf 链接并将整个文本刮到 csv 上.全文应全部放入 1 行,但 csv 文件中的输出显示一篇 11 行的文章.我该如何解决这个问题?

I've been web scraping articles in R from the Oxford journals and want to grab the full text of specific articles. All articles have a pdf link to them so I've been trying to pull the pdf link and scrape the entire text onto a csv. The full text should all fit into 1 row however the output in the csv file shows one article of 11 rows. How can I fix this issue?

代码如下:

####install.packages("rvest")
library(rvest)
library(RCurl)
library(XML)
library(stringr)
#for Fulltext to read pdf
####install.packages("pdftools")
library(pdftools)


fullText <- function(parsedDocument){
  endLink <- parsedDocument %>%
    html_node('.article-pdfLink') %>% html_attr('href')
  frontLink <- "https://academic.oup.com"
  #link  of pdf
  pdfLink <- paste(frontLink,endLink,sep = "")
  #extract full text  from pdfLink
  pdfFullText <- pdf_text(pdfLink)
  fulltext <- paste(pdfFullText, sep = "\n")
  return(fulltext)
}
#############################################

#main function with input as parameter year
testFullText <- function(DOIurl){
  parsedDocument <- read_html(DOIurl)
  DNAresearch <- data.frame()
  allData <- data.frame("Full Text" = fullText(parsedDocument), stringsAsFactors = FALSE)
  DNAresearch <-  rbind(DNAresearch, allData)
  write.csv(DNAresearch, "DNAresearch.csv", row.names = FALSE)
}
testFullText("https://doi.org/10.1093/dnares/dsm026")

推荐答案

看你的最后一个函数,如果我没理解错,你想把 url 和所有的文本刮到一个数据框/tibble 中然后导出到一个csv.这是仅用 1 篇文章就可以做到的方法,并且您应该能够通过一些操作来循环浏览一些链接(如果我误解了,请道歉):

Looking at your last function, if I understand correctly, you want to take the url and scrape all the text into the a data frame/tibble and then export it to a csv. Here is how you can do it with just 1 article, and you should be able to loop through some links with a little manipulation (apologies if I am misunderstanding):

library(tidyverse)
library(rvest)

# read in html link
document_link <- read_html("https://doi.org/10.1093/dnares/dsm026")

# get the text, and put it into a tibble with only 1 row
text_tibble <- document_link %>% 
  html_nodes('.chapter-para') %>% 
  html_text() %>% 
  as_tibble() %>% 
  summarize(full_text = paste(value, collapse = " ")) ## this will collpase to 1 row

# now write to csv
## write_csv(text_tibble, file = "")

这篇关于R webscraper 没有在一行中输出 pdf 文本的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆