使用R将PDF文件转换为文本文件以进行文本挖掘 [英] Use R to convert PDF files to text files for text mining

查看:342
本文介绍了使用R将PDF文件转换为文本文件以进行文本挖掘的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在一个文件夹中有近一千篇pdf期刊文章.我需要从整个文件夹中的所有文章摘要中获取文本信息.现在,我正在执行以下操作:

I have nearly one thousand pdf journal articles in a folder. I need to text mine on all article's abstracts from the whole folder. Now I am doing the following:

dest <- "~/A1.pdf"

# set path to pdftotxt.exe and convert pdf to text
exe <- "C:/Program Files (x86)/xpdfbin-win-3.03/bin32/pdftotext.exe"
system(paste("\"", exe, "\" \"", dest, "\"", sep = ""), wait = F)

# get txt-file name and open it
filetxt <- sub(".pdf", ".txt", dest)
shell.exec(filetxt)

通过这种方式,我将一个pdf文件转换为一个.txt文件,然后将摘要复制到另一个.txt文件中,并手动对其进行编译.这项工作很麻烦.

By this, I am converting one pdf file to one .txt file and then copying the abstract in another .txt file and compile it manually. This work is troublesome.

如何从文件夹中读取所有单独的文章,然后将它们转换为.txt文件,其中仅包含每篇文章的摘要.可以通过限制每篇文章的摘要和简介之间的内容来完成;但我无法这样做.感谢您的帮助.

How can I read all individual articles from the folder and convert them into .txt file which contain only the abstract from each article. It can be done by limiting the content between ABSTRACT and INTRODUCTION in each article; but I am not able to do so. Any help is appreciated.

推荐答案

是的,并不是像IShouldBuyABoat所说的,实际上不是一个R问题,但是R只能在很小的扭曲下完成...

Yes, not really an R question as IShouldBuyABoat notes, but something that R can do with only minor contortions...

使用R将PDF文件转换为txt文件...

Use R to convert PDF files to txt files...

# folder with 1000s of PDFs
dest <- "C:\\Users\\Desktop"

# make a vector of PDF file names
myfiles <- list.files(path = dest, pattern = "pdf",  full.names = TRUE)

# convert each PDF file that is named in the vector into a text file 
# text file is created in the same directory as the PDFs
# note that my pdftotext.exe is in a different location to yours
lapply(myfiles, function(i) system(paste('"C:/Program Files/xpdf/bin64/pdftotext.exe"', 
             paste0('"', i, '"')), wait = FALSE) )

仅从txt文件中提取摘要...

Extract only abstracts from txt files...

# if you just want the abstracts, we can use regex to extract that part of
# each txt file, Assumes that the abstract is always between the words 'Abstract'
# and 'Introduction'
mytxtfiles <- list.files(path = dest, pattern = "txt",  full.names = TRUE)
abstracts <- lapply(mytxtfiles, function(i) {
  j <- paste0(scan(i, what = character()), collapse = " ")
  regmatches(j, gregexpr("(?<=Abstract).*?(?=Introduction)", j, perl=TRUE))
})

将摘要写入单独的txt文件...

Write abstracts into separate txt files...

# write abstracts as txt files 
# (or use them in the list for whatever you want to do next)
lapply(1:length(abstracts),  function(i) write.table(abstracts[i], file=paste(mytxtfiles[i], "abstract", "txt", sep="."), quote = FALSE, row.names = FALSE, col.names = FALSE, eol = " " ))

现在您已经准备好对摘要进行一些文本挖掘了.

And now you're ready to do some text mining on the abstracts.

这篇关于使用R将PDF文件转换为文本文件以进行文本挖掘的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆