大文本语料库打破了tm_map [英] Big Text Corpus breaks tm_map

查看:86
本文介绍了大文本语料库打破了tm_map的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在过去的几天里,我一直在为此而挣扎.我搜索了所有的SO档案,并尝试了建议的解决方案,但似乎无法正常工作.我在2000 06、1995 -99等文件夹中有txt文档集,并且想要运行一些基本的文本挖掘操作,例如创建文档术语矩阵和术语文档矩阵,以及基于单词的共置位置进行一些操作.我的脚本适用于较小的语料库,但是,当我尝试使用较大的语料库时,它会使我失望.我已经为一种这样的文件夹操作粘贴了代码.

I have been breaking my head over this one over the last few days. I searched all the SO archives and tried the suggested solutions but just can't seem to get this to work. I have sets of txt documents in folders such as 2000 06, 1995 -99 etc, and want to run some basic text mining operations such as creating document term matrix and term document matrix and doing some operations based co-locations of words. My script works on a smaller corpus, however, when I try it with the bigger corpus, it fails me. I have pasted in the code for one such folder operation.

library(tm) # Framework for text mining.
library(SnowballC) # Provides wordStem() for stemming.
library(RColorBrewer) # Generate palette of colours for plots.
library(ggplot2) # Plot word frequencies.
library(magrittr)
library(Rgraphviz)
library(directlabels)

setwd("/ConvertedText")
txt <- file.path("2000 -06")

docs<-VCorpus(DirSource(txt, encoding = "UTF-8"),readerControl = list(language = "UTF-8"))
docs <- tm_map(docs, content_transformer(tolower), mc.cores=1)
docs <- tm_map(docs, removeNumbers, mc.cores=1)
docs <- tm_map(docs, removePunctuation, mc.cores=1)
docs <- tm_map(docs, stripWhitespace, mc.cores=1)
docs <- tm_map(docs, removeWords, stopwords("SMART"), mc.cores=1)
docs <- tm_map(docs, removeWords, stopwords("en"), mc.cores=1)
#corpus creation complete

setwd("/ConvertedText/output")
dtm<-DocumentTermMatrix(docs)
tdm<-TermDocumentMatrix(docs)
m<-as.matrix(dtm)
write.csv(m, file="dtm.csv")
dtms<-removeSparseTerms(dtm, 0.2)
m1<-as.matrix(dtms)
write.csv(m1, file="dtms.csv")
# matrix creation/storage complete

freq <- sort(colSums(as.matrix(dtm)), decreasing=TRUE)
wf <- data.frame(word=names(freq), freq=freq)
freq[1:50]
#adjust freq score in next line
p <- ggplot(subset(wf, freq>100), aes(word, freq))+ geom_bar(stat="identity")+ theme(axis.text.x=element_text(angle=45, hjust=1))
ggsave("frequency2000-06.png", height=12,width=17, dpi=72)
# frequency graph generated


x<-as.matrix(findFreqTerms(dtm, lowfreq=1000))
write.csv(x, file="freqterms00-06.csv")
png("correlation2000-06.png", width=12, height=12, units="in", res=900)
graph.par(list(edges=list(col="lightblue", lty="solid", lwd=0.3)))
graph.par(list(nodes=list(col="darkgreen", lty="dotted", lwd=2, fontsize=50)))
plot(dtm, terms=findFreqTerms(dtm, lowfreq=1000)[1:50],corThreshold=0.7)
dev.off()

当我在tm_map中使用mc.cores = 1参数时,该操作将无限期地继续.但是,如果我在tm_map中使用lazy = TRUE参数,则看似运行良好,但是后续操作会出现此错误.

When I use the mc.cores=1 argument in tm_map, the operation continues indefinitely. However, if I use the lazy=TRUE argument in tm_map, it seemingly goes well, but subsequent operations give this error.

Error in UseMethod("meta", x) : 
  no applicable method for 'meta' applied to an object of class "try-error"
In addition: Warning messages:
1: In mclapply(x$content[i], function(d) tm_reduce(d, x$lazy$maps)) :
  all scheduled cores encountered errors in user code
2: In mclapply(unname(content(x)), termFreq, control) :
  all scheduled cores encountered errors in user code

我一直在寻找解决方案,但始终失败.任何帮助将不胜感激!

I have been looking all over for a solution but have failed consistently. Any help would be greatly appreciated!

最好! k

推荐答案

我找到了可行的解决方案.

I found a solution that works.

背景/调试步骤

我尝试了几种无效的方法:

I tried several things that did not work:

  • 将"content_transformer"添加到某个tm_map,甚至添加到一个(塔楼)
  • 在tm_map中添加"lazy = T"
  • 尝试了一些并行计算程序包

虽然它不能用于我的两个脚本,但每次都可用于第三个脚本.但是,这三个脚本的代码都是相同的,只是我加载的.rda文件的大小不同.这三个数据库的数据结构也相同.

While it isn't working for 2 of my scripts, it works every time for a third script. But the code of all three scripts is the same only the size of the .rda file I am loading is different. The data structure is also identical for all three.

  • 数据集1:大小-493.3KB =错误
  • 数据集2:大小-630.6KB =错误
  • 数据集3:大小-300.2KB =有效!

很奇怪.

我的sessionInfo()输出:

R version 3.1.2 (2014-10-31)
Platform: x86_64-apple-darwin13.4.0 (64-bit)

locale:
[1] de_DE.UTF-8/de_DE.UTF-8/de_DE.UTF-8/C/de_DE.UTF-8/de_DE.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] snowfall_1.84-6    snow_0.3-13        Snowball_0.0-11    RWekajars_3.7.11-1 rJava_0.9-6              RWeka_0.4-23      
[7] slam_0.1-32        SnowballC_0.5.1    tm_0.6             NLP_0.1-5          twitteR_1.1.8      devtools_1.6      

loaded via a namespace (and not attached):
[1] bit_1.1-12     bit64_0.9-4    grid_3.1.2     httr_0.5       parallel_3.1.2 RCurl_1.95-4.3    rjson_0.2.14   stringr_0.6.2 
[9] tools_3.1.2

解决方案

我只是在加载数据后添加了这一行,现在一切正常:

I just added this line after loading the data and everything works now:

MyCorpus <- tm_map(MyCorpus,
                     content_transformer(function(x) iconv(x, to='UTF-8-MAC', sub='byte')),
                     mc.cores=1)

在此处找到了提示: http://davetang.org/muse/2013 /04/06/using-the-r_twitter-package/(由于错误,作者于2014年11月26日更新了代码).

Found the hint here: http://davetang.org/muse/2013/04/06/using-the-r_twitter-package/ (The author has updated his code due to the error on November 26, 2014.)

这篇关于大文本语料库打破了tm_map的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆