使用词频进行文本挖掘pdf文件/问题 [英] Text mining pdf files/issues with word frequencies

查看：196 发布时间：2020/7/11 0:19:58 r pdf ghostscript tm text-recognition

本文介绍了使用词频进行文本挖掘pdf文件/问题的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在尝试挖掘具有丰富pdf编码和图形的文章的pdf.我注意到，当我挖掘一些pdf文档时，我得到的高频单词是phi，taeoe，toe，sigma，gamma等.它与某些pdf文档一起使用时效果很好，但我与其他人却收到了这些随机的希腊字母.这是字符编码的问题吗? (顺便说一下，所有文件都是英文的).有什么建议?

I am trying to mine a pdf of an article with rich pdf encodings and graphs. I noticed that when i mine some pdf documents i get the high frequency words to be p taeoe,toe,sigma, gamma etc. It works well with some pdf documents but i get these random greek letters with others. Is this the problem with character encoding? (Btw all the documents are in english). Any suggestions?

# Here is the link to pdf file for testing
# www.sciencedirect.com/science/article/pii/S0164121212000532
library(tm)
uri <- c("2012.pdf")
if(all(file.exists(Sys.which(c("pdfinfo", "pdftotext"))))) {
 pdf <- readPDF(control = list(text = "-layout"))(elem = list(uri = uri),
                                              language = "en",
                                              id = "id1")
 content(pdf)[1:4]
 }


docs<- Corpus(URISource(uri, mode = ""),
    readerControl = list(reader = readPDF(engine = "ghostscript")))
summary(docs)
docs <- tm_map(docs, removePunctuation)
docs <- tm_map(docs, removeNumbers)  
docs <- tm_map(docs, tolower) 
docs <- tm_map(docs, removeWords, stopwords("english")) 

library(SnowballC)   
docs <- tm_map(docs, stemDocument)  
docs <- tm_map(docs, stripWhitespace) 
docs <- tm_map(docs, PlainTextDocument)  

dtm <- DocumentTermMatrix(docs)   
tdm <- TermDocumentMatrix(docs) 
freq <- colSums(as.matrix(dtm))   
length(freq)  
ord <- order(freq)
dtms <- removeSparseTerms(dtm, 0.1)
freq[head(ord)] 
freq[tail(ord)]

推荐答案

我认为ghostscript在这里造成了所有麻烦.假设正确安装了pdfinfo和pdftotext，此代码将正常工作，而不会生成您提到的怪异单词:

I think that ghostscript is creating all the trouble here. Assuming that pdfinfo and pdftotext are properly installed, this code works without generating the weird words that you mentioned:

library(tm)
uri <- c("2012.pdf")
pdf <- readPDF(control = list(text = "-layout"))(elem = list(uri = uri),
                                               language = "en",
                                               id = "id1")
docs <- Corpus(VectorSource(pdf$content))
docs <- tm_map(docs, removeNumbers)  
docs <- tm_map(docs, tolower) 
docs <- tm_map(docs, removeWords, stopwords("english")) 
docs <- tm_map(docs, removePunctuation) 
library(SnowballC)   
docs <- tm_map(docs, stemDocument)  
docs <- tm_map(docs, stripWhitespace) 
docs <- tm_map(docs, PlainTextDocument)  
dtm <- DocumentTermMatrix(docs)   
tdm <- TermDocumentMatrix(docs) 
freq <- colSums(as.matrix(dtm))

我们可以可视化您的pdf文件中最常用的单词的结果带有词云:

We can visualize the result of the most frequently used words in your pdf file with a word cloud:

library(wordcloud)
wordcloud(docs, max.words=80, random.order=FALSE, scale= c(3, 0.5), colors=brewer.pal(8,"Dark2"))

显然，这个结果并不完美；主要是因为词干几乎无法获得100％可靠的结果(例如，我们仍然将问题"和问题"作为单独的词；或方法"和方法").即使SnowballC做得相当不错，我也不知道R中有任何可靠的词干算法.

Obviously this result is not perfect; mostly because word stemming hardly ever achieves a 100% reliable result (e.g., we have still "issues" and "issue" as separate words; or "method" and "methods"). I am not aware of any infallible stemming algorithm in R, even though SnowballC does a reasonably good job.

这篇关于使用词频进行文本挖掘pdf文件/问题的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

使用词频进行文本挖掘pdf文件/问题 [英] Text mining pdf files/issues with word frequencies

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

使用词频进行文本挖掘pdf文件/问题 [英] Text mining pdf files/issues with word frequencies

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭