词典中带有短语的R情感分析 [英] R sentiment analysis with phrases in dictionaries

查看:79
本文介绍了词典中带有短语的R情感分析的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在对自己拥有的一组推文进行情感分析,现在我想知道如何在正面和负面字典中添加短语.

I am performing sentiment analysis on a set of Tweets that I have and I now want to know how to add phrases to the positive and negative dictionaries.

我已经读了我想测试的短语文件,但是在进行情感分析时却没有得到结果.

I've read in the files of the phrases I want to test but when running the sentiment analysis it doesn't give me a result.

在阅读情感算法时,我可以看到它使单词与词典匹配,但是有没有办法扫描单词和短语?

When reading through the sentiment algorithm, I can see that it is matching the words to the dictionaries but is there a way to scan for words as well as phrases?

这是代码:

    score.sentiment = function(sentences, pos.words, neg.words, .progress='none')
{
  require(plyr)  
  require(stringr)  
  # we got a vector of sentences. plyr will handle a list  
  # or a vector as an "l" for us  
  # we want a simple array ("a") of scores back, so we use  
  # "l" + "a" + "ply" = "laply":  
  scores = laply(sentences, function(sentence, pos.words, neg.words) {
    # clean up sentences with R's regex-driven global substitute, gsub():
    sentence = gsub('[[:punct:]]', '', sentence)
    sentence = gsub('[[:cntrl:]]', '', sentence)
    sentence = gsub('\\d+', '', sentence)    
    # and convert to lower case:    
    sentence = tolower(sentence)    
    # split into words. str_split is in the stringr package    
    word.list = str_split(sentence, '\\s+')    
    # sometimes a list() is one level of hierarchy too much    
    words = unlist(word.list)    
    # compare our words to the dictionaries of positive & negative terms
    pos.matches = match(words, pos)
    neg.matches = match(words, neg)   
    # match() returns the position of the matched term or NA    
    # we just want a TRUE/FALSE:    
    pos.matches = !is.na(pos.matches)   
    neg.matches = !is.na(neg.matches)   
    # and conveniently enough, TRUE/FALSE will be treated as 1/0 by sum():
    score = sum(pos.matches) - sum(neg.matches)    
    return(score)    
  }, pos.words, neg.words, .progress=.progress )  
  scores.df = data.frame(score=scores, text=sentences)  
  return(scores.df)  
}
analysis=score.sentiment(Tweets, pos, neg)
table(analysis$score)

这是我得到的结果:

0
20

而我在此功能提供的标准表之后 例如

whereas I am after the standard table that this function provides e.g.

-2 -1 0 1 2 
 1  2 3 4 5 

例如.

有人对如何在词组上运行有任何想法吗? 注意:TWEETS文件是句子文件.

Does anybody have any ideas on how to run this on phrases? Note: The TWEETS file is a file of sentences.

推荐答案

函数score.sentiment似乎有效.如果我尝试一个非常简单的设置,

The function score.sentiment seems to work. If I try a very simple setup,

Tweets = c("this is good", "how bad it is")
neg = c("bad")
pos = c("good")
analysis=score.sentiment(Tweets, pos, neg)
table(analysis$score)

我得到了预期的结果,

> table(analysis$score)

-1  1 
 1  1 

您如何向该方法提供20条推文?根据您发布的结果,该0 20,我想说的是您的问题是您的20条推文没有任何正面或负面的词,尽管您确实会注意到这种情况.也许,如果您在推文列表中发布更多详细信息,则您的正面和负面的话会更容易为您提供帮助.

How are you feeding the 20 tweets to the method? From the result you're posting, that 0 20, I'd say that your problem is that your 20 tweets do not have any positive or negative word, although of course it was the case you would have noticed it. Maybe if you post more details on your list of tweets, your positive and negative words it would be easier to help you.

无论如何,您的功能似乎工作正常.

Anyhow, your function seems to be working just fine.

希望它会有所帮助.

实际上,要解决您的问题,您需要将句子标记为n-grams,其中n对应于您用于正负n-grams列表的最大单词数.您可以查看如何执行此操作,例如在这个SO问题中.为了完整起见,并且由于我已经自己对其进行了测试,因此以下是您可以执行的操作示例.我将其简化为bigrams(n = 2)并使用以下输入:

Actually, to solve your problem you need to tokenize your sentences into n-grams, where n would correspond to the maximum number of words you are using for your list of positive and negative n-grams. You can see how to do this e.g. in this SO question. For completeness, and since I've tested it myself, here is an example for what you could do. I simplify it to bigrams (n=2) and use the following inputs:

Tweets = c("rewarding hard work with raising taxes and VAT. #LabourManifesto", 
              "Ed Miliband is offering 'wrong choice' of 'more cuts' in #LabourManifesto")
pos = c("rewarding hard work")
neg = c("wrong choice")

您可以像这样创建一个bigram令牌生成器,

You can create a bigram tokenizer like this,

library(tm)
library(RWeka)
BigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min=2,max=2))

并对其进行测试,

> BigramTokenizer("rewarding hard work with raising taxes and VAT. #LabourManifesto")
[1] "rewarding hard"       "hard work"            "work with"           
[4] "with raising"         "raising taxes"        "taxes and"           
[7] "and VAT"              "VAT #LabourManifesto"

然后在您的方法中,您只需替换此行,

Then in your method you simply substitute this line,

word.list = str_split(sentence, '\\s+')

以此

word.list = BigramTokenizer(sentence)

当然,如果将word.list更改为ngram.list或类似的方法会更好.

Although of course it would be better if you changed word.list to ngram.list or something like that.

结果符合预期

> table(analysis$score)

-1  0 
 1  1

只需确定您的n-gram大小并将其添加到Weka_control,您就可以了.

Just decide your n-gram size and add it to Weka_control and you should be fine.

希望有帮助.

这篇关于词典中带有短语的R情感分析的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆