使用字典时 DocumentTermMatrix 计数错误 [英] DocumentTermMatrix wrong counting when using a dictionary

查看:27
本文介绍了使用字典时 DocumentTermMatrix 计数错误的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

实际上,我正在尝试使用朴素贝叶斯算法基于 twitter 数据进行情感分析.

actually I am trying to do a sentiment analysis based on twitter data using the naive bayes algorithm.

我查看了 2000 条推文.

I have a look on 2000 Tweets.

将数据输入 R studio 后,我按如下方式拆分和预处理日期:

After getting the data into R studio I split and preprocess the date as follows:

train_size = floor(0.75 * nrow(Tweets_Model_Input))
set.seed(123)
train_sub = sample(seq_len(nrow(Tweets_Model_Input)), size = train_size)

Tweets_Model_Input_Train = Tweets_Model_Input[train_sub, ]
Tweets_Model_Input_Test = Tweets_Model_Input[-train_sub, ]

myCorpus = Corpus(VectorSource(Tweets_Model_Input_Train$SentimentText))
myCorpus <- tm_map(myCorpus, removeWords, stopwords("english")) #removes common prepositions and conjunctions 
myCorpus <- tm_map(myCorpus, stripWhitespace)
removeURL <- function(x) gsub("http[^[:space:]]*", "", x)
myCorpus <- tm_map(myCorpus, removeURL)
removeRetweet <- function(x) gsub("(RT|via)((?:\\b\\W*@\\w+)+)", "", x)
myCorpus <- tm_map(myCorpus, removeRetweet)
myCorpus <- tm_map(myCorpus, removeNumbers)
myCorpus <- tm_map(myCorpus, removePunctuation)
myCorpus <- tm_map(myCorpus, content_transformer(tolower))
myCorpus <- tm_map(myCorpus, PlainTextDocument)
myCorpus.train <- tm_map(myCorpus, stemDocument, language = "english")  
myCorpus.train <- Corpus(VectorSource(myCorpus.train$content))


myCorpus = Corpus(VectorSource(Tweets_Model_Input_Test$SentimentText))
myCorpus <- tm_map(myCorpus, removeWords, stopwords("english")) #removes common prepositions and conjunctions 
myCorpus <- tm_map(myCorpus, stripWhitespace)
removeURL <- function(x) gsub("http[^[:space:]]*", "", x)
myCorpus <- tm_map(myCorpus, removeURL)
removeRetweet <- function(x) gsub("(RT|via)((?:\\b\\W*@\\w+)+)", "", x)
myCorpus <- tm_map(myCorpus, removeRetweet)
myCorpus <- tm_map(myCorpus, removeNumbers)
myCorpus <- tm_map(myCorpus, removePunctuation)
myCorpus <- tm_map(myCorpus, content_transformer(tolower))
myCorpus <- tm_map(myCorpus, PlainTextDocument)
myCorpus.test <- tm_map(myCorpus, stemDocument, language = "english") 
myCorpus.test <- Corpus(VectorSource(myCorpus.test$content))

所以我为我的 NB 算法获得了一列火车和一个测试语料库.这样做之后,我想根据在火车语料库中出现至少 50 次的术语创建两个 DTM.这些术语是:get"miss"day"just"now"want"good"work"

So I get a train and a test corpus for my NB algorithm. After doing that I would like to create two DTM's based on the terms which appear at least 50 times in the train corpus. These terms are : "get" "miss" "day" "just" "now" "want" "good" "work"

fivefreq = findFreqTerms(dtm.train, lowfreq = 50, highfreq = Inf)
length((fivefreq))

dtm.train <- DocumentTermMatrix(myCorpus.train, control=list(dictionary = fivefreq))
dtm.test <- DocumentTermMatrix(myCorpus.test, control=list(dictionary = fivefreq))

对于 dtm.train 来说它工作得很好,但是对于 dtm.test 它根本不起作用.DTM 基于以上选择的术语,但矩阵本身中的计数数字不正确.

For dtm.train it works pretty well, but for dtm.test it doesn't work at all. The DTM is based in the terms selected above, but the count numbers in the matrix itself are not correct.

推文编号.训练语料库的 1 是omg celli发生多年wtf得到账单支付".DTM 的子集是正确的:

Tweet no. 1 of the training corpus is "omg celli happen yearswtf gota get bill paid". The subset of the DTM is correct:

推文编号.3 的测试语料是巨大的滚雷只是现在如此可怕".DTM 的子集不正确:

Tweet no. 3 of the test corpus is "huge roll thunder just nowso scari". The subset of the DTM is not correct:

该推文中没有get".但有一个正义".所以计数在某种程度上是正确的,但在错误的列中.

There is not "get" in that tweets. But there is a "just". So the counting is somehow right, but in the wrong column.

我尝试了很多来解决这个问题,但实际上我不知道还有什么可做的.对我来说,tm 似乎是根据特定语料库的术语创建 DTM,而字典仅用于替换列名,没有任何功能.

I tried so much to solve that problem but actually I don't know anything else to do. For me it seems like that tm is creating the DTM based on the terms of the specific corpus and the dictionary is only used to replace the column name without any function.

感谢您的帮助!

推荐答案

这个一个实际的错误.使用 VCorpus() 而不是 Corpus() 也可以解决这个问题.

this is an actual bug. Using VCorpus() instead of Corpus() will also fix the problem.

这似乎是一个实际的错误.尝试恢复到 0.6-2 版.这为我解决了问题.

This seems to be an actual bug. Try reverting back to version 0.6-2. That fixed the problem for me.

这篇关于使用字典时 DocumentTermMatrix 计数错误的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆