R LDA主题建模:结果主题包含非常相似的词 [英] R LDA Topic Modeling: Result topics contains very similar words

查看:96
本文介绍了R LDA主题建模:结果主题包含非常相似的词的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

全部:

我是R主题建模的初学者,这一切都始于三周前.所以我的问题是我可以成功地将数据处理成语料库,文档术语矩阵和LDA函数.我有推文作为输入,约有460,000条推文.但是我对结果不满意,所有主题的词都非常相似.

I am beginner in R topic modeling, it all started three weeks ago. So my problem is I can successfully processed my data into corpus, Document term matrix and LDA function. I have tweets as my input and about 460,000 tweets. But I am not happy with the result, the words across all topic are very similar.

packages <- c('tm','topicmodels','SnowballC','RWeka','rJava')
if (length(setdiff(packages, rownames(installed.packages()))) > 0) {
install.packages(setdiff(packages, rownames(installed.packages())))  
}

options( java.parameters = "-Xmx4g" )
library(tm)
library(topicmodels)
library(SnowballC)
library(RWeka)

print("Please select the input file");
flush.console();
ifilename <- file.choose();
raw_data=scan(ifilename,'string',sep="\n",skip=1);

tweet_data=raw_data;
rm(raw_data);
tweet_data = gsub("(RT|via)((?:\\b\\W*@\\w+)+)","",tweet_data)
tweet_data = gsub("http[^[:blank:]]+", "", tweet_data)
tweet_data = gsub("@\\w+", "", tweet_data)
tweet_data = gsub("[ \t]{2,}", "", tweet_data)
tweet_data = gsub("^\\s+|\\s+$", "", tweet_data)
tweet_data = gsub('\\d+', '', tweet_data)
tweet_data = gsub("[[:punct:]]", " ", tweet_data)

max_size=5000;
data_size=length(tweet_data);
itinerary=ceiling(data_size[1]/max_size);
if (itinerary==1){pre_data=tweet_data}else {pre_data=tweet_data[1:max_size]}

corp <- Corpus(VectorSource(pre_data));
corp<-tm_map(corp,tolower);
corp<-tm_map(corp,removePunctuation);
extend_stop_word=c('description:','null','text:','description','url','text','aca',
                   'obama','romney','ryan','mitt','conservative','liberal');
corp<-tm_map(corp,removeNumbers);
gc();
IteratedLovinsStemmer(corp, control = NULL)
gc();
corp<-tm_map(corp,removeWords,c(stopwords('english'),extend_stop_word));
gc();
corp <- tm_map(corp, PlainTextDocument)
gc();
dtm.control = list(tolower= F,removePunctuation=F,removeNumbers= F,
                   stemming= F, minWordLength = 3,weighting= weightTf,stopwords=F)

dtm = DocumentTermMatrix(corp, control=dtm.control)
gc();
#dtm = removeSparseTerms(dtm,0.99)
dtm = dtm[rowSums(as.matrix(dtm))>0,]
gc();

best.model <- lapply(seq(2,50, by=2), function(k){LDA(dtm[1:10,], k)})
gc();
best.model.logLik <- as.data.frame(as.matrix(lapply(best.model, logLik)))
best.model.logLik.df <- data.frame(topics=c(seq(2,50, by=2)), LL=as.numeric(as.matrix(best.model.logLik)))
k=best.model.logLik.df[which.max(best.model.logLik.df$LL),1];
cat("Best topic number is k=",k);
flush.console();
gc();
lda.model = LDA(dtm, k,method='VEM')
gc();
write.csv(terms(lda.model,50), file = "terms.csv");
lda_topics=topics(lda.model,1);

以下是我得到的结果:

> terms(lda.model,10)
      Topic 1     Topic 2    Topic 3    Topic 4    Topic 5   
 [1,] "taxes"     "medicare" "tax"      "tax"      "jobs"    
 [2,] "pay"       "will"     "returns"  "returns"  "plan"    
 [3,] "welfare"   "tax"      "gop"      "taxes"    "gop"     
 [4,] "will"      "care"     "taxes"    "will"     "military"
 [5,] "returns"   "can"      "abortion" "paul"     "will"    
 [6,] "plan"      "laden"    "can"      "medicare" "tax"     
 [7,] "economy"   "vote"     "tcot"     "class"    "paul"    
 [8,] "budget"    "economy"  "muslim"   "budget"   "campaign"
 [9,] "president" "taxes"    "campaign" "says"     "says"    
[10,] "reid"      "just"     "economy"  "cuts"     "can"     
      Topic 6     Topic 7     Topic 8    Topic 9    
 [1,] "medicare"  "tax"       "medicare" "tax"      
 [2,] "taxes"     "medicare"  "tax"      "president"
 [3,] "plan"      "taxes"     "jobs"     "jobs"     
 [4,] "tcot"      "tcot"      "tcot"     "taxes"    
 [5,] "budget"    "president" "foreign"  "medicare" 
 [6,] "returns"   "jobs"      "plan"     "tcot"     
 [7,] "welfare"   "budget"    "will"     "paul"     
 [8,] "can"       "energy"    "economy"  "health"   
 [9,] "says"      "military"  "bush"     "people"   
[10,] "obamacare" "want"      "now"      "gop"      
      Topic 10    Topic 11   Topic 12  
 [1,] "tax"       "gop"      "gop"     
 [2,] "medicare"  "tcot"     "plan"    
 [3,] "tcot"      "military" "tax"     
 [4,] "president" "jobs"     "taxes"   
 [5,] "gop"       "energy"   "welfare" 
 [6,] "plan"      "will"     "tcot"    
 [7,] "jobs"      "ohio"     "military"
 [8,] "will"      "abortion" "campaign"
 [9,] "cuts"      "paul"     "class"   
[10,] "paul"      "budget"   "just" 

正如您所看到的,税",医疗"一词遍及所有主题.我注意到,当我玩dtm = removeSparseTerms(dtm,0.99)时,结果可能会有所变化.以下是我的示例输入数据

As you can see the words "tax" "medicare" are across all topic. I noticed that while I playing with the dtm = removeSparseTerms(dtm,0.99) the results may changes a little. And the following is my sample input data

> tweet_data[1:10]
 [1] " While  Romney friends get richer  MT  Romney Ryan Economic Plans Would Increase Unemployment Deepen Recession"                 
 [2] "Wayne Allyn Root claims proof of Obama s foreign citizenship  During a radio show interview Resist"                             
 [3] " President Obama  Chief Investor  Leave Energy Upgrades to the Businesses  Reading President Obama誷 latest Execu   "           
 [4] " Brotherhood  starts crucifixions Opponents of Egypt s Muslim president executed  naked on trees   Obama s    tcot"             
 [5] "  Say you stand with President Obama裻he candidate in this election who trusts women to make their own health decisions     "   
 [6] " Romney  Ryan Descend Into Medicare Gibberish "                                                                                 
 [7] "Maddow  Romney demanded opponents tax returns and lied about residency in    The Raw Story"                                     
 [8] "Is it not grand  How can Jews reconcile Obama   Carter s treatment of Jews Israel  How ca    "                                  
 [9] "   The Tax Returns are Hurting Romney  Badly  "                                                                                 
[10] "  Replacing Gen Dempsey is cruicial to US security  Dempsey  disappointed  by anti Obama campaign by ex military members  h    "

请帮助!谢谢!

推荐答案

减少案例数量.这将增强主题模型的聚类能力.现在,您将现有模型与另一个模型重叠.由于主题索引随迭代而变化,因此也难以追踪结果/进行比较.

Reduce the number of topics in your case. This would enhance the clustering capability of your topic model. Now you are overlapping existing models with another. Since topic index varies over iterations, it is difficult to follow through on the results/ compare too.

这篇关于R LDA主题建模:结果主题包含非常相似的词的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆