使用词频进行文本挖掘pdf文件/问题 [英] Text mining pdf files/issues with word frequencies
问题描述
我正在尝试挖掘具有丰富pdf编码和图形的文章的pdf.我注意到,当我挖掘一些pdf文档时,我得到的高频单词是phi,taeoe,toe,sigma,gamma等.它与某些pdf文档一起使用时效果很好,但我与其他人却收到了这些随机的希腊字母.这是字符编码的问题吗? (顺便说一下,所有文件都是英文的).有什么建议?
I am trying to mine a pdf of an article with rich pdf encodings and graphs. I noticed that when i mine some pdf documents i get the high frequency words to be p taeoe,toe,sigma, gamma etc. It works well with some pdf documents but i get these random greek letters with others. Is this the problem with character encoding? (Btw all the documents are in english). Any suggestions?
# Here is the link to pdf file for testing
# www.sciencedirect.com/science/article/pii/S0164121212000532
library(tm)
uri <- c("2012.pdf")
if(all(file.exists(Sys.which(c("pdfinfo", "pdftotext"))))) {
pdf <- readPDF(control = list(text = "-layout"))(elem = list(uri = uri),
language = "en",
id = "id1")
content(pdf)[1:4]
}
docs<- Corpus(URISource(uri, mode = ""),
readerControl = list(reader = readPDF(engine = "ghostscript")))
summary(docs)
docs <- tm_map(docs, removePunctuation)
docs <- tm_map(docs, removeNumbers)
docs <- tm_map(docs, tolower)
docs <- tm_map(docs, removeWords, stopwords("english"))
library(SnowballC)
docs <- tm_map(docs, stemDocument)
docs <- tm_map(docs, stripWhitespace)
docs <- tm_map(docs, PlainTextDocument)
dtm <- DocumentTermMatrix(docs)
tdm <- TermDocumentMatrix(docs)
freq <- colSums(as.matrix(dtm))
length(freq)
ord <- order(freq)
dtms <- removeSparseTerms(dtm, 0.1)
freq[head(ord)]
freq[tail(ord)]
推荐答案
我认为ghostscript
在这里造成了所有麻烦.假设正确安装了pdfinfo
和pdftotext
,此代码将正常工作,而不会生成您提到的怪异单词:
I think that ghostscript
is creating all the trouble here. Assuming that pdfinfo
and pdftotext
are properly installed, this code works without generating the weird words that you mentioned:
library(tm)
uri <- c("2012.pdf")
pdf <- readPDF(control = list(text = "-layout"))(elem = list(uri = uri),
language = "en",
id = "id1")
docs <- Corpus(VectorSource(pdf$content))
docs <- tm_map(docs, removeNumbers)
docs <- tm_map(docs, tolower)
docs <- tm_map(docs, removeWords, stopwords("english"))
docs <- tm_map(docs, removePunctuation)
library(SnowballC)
docs <- tm_map(docs, stemDocument)
docs <- tm_map(docs, stripWhitespace)
docs <- tm_map(docs, PlainTextDocument)
dtm <- DocumentTermMatrix(docs)
tdm <- TermDocumentMatrix(docs)
freq <- colSums(as.matrix(dtm))
我们可以可视化您的pdf文件中最常用的单词的结果带有词云:
We can visualize the result of the most frequently used words in your pdf file with a word cloud:
library(wordcloud)
wordcloud(docs, max.words=80, random.order=FALSE, scale= c(3, 0.5), colors=brewer.pal(8,"Dark2"))
显然,这个结果并不完美;主要是因为词干几乎无法获得100%可靠的结果(例如,我们仍然将问题"和问题"作为单独的词;或方法"和方法").即使SnowballC
做得相当不错,我也不知道R中有任何可靠的词干算法.
Obviously this result is not perfect; mostly because word stemming hardly ever achieves a 100% reliable result (e.g., we have still "issues" and "issue" as separate words; or "method" and "methods"). I am not aware of any infallible stemming algorithm in R, even though SnowballC
does a reasonably good job.
这篇关于使用词频进行文本挖掘pdf文件/问题的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!