NLTK情绪分析仅返回一个值 [英] NLTK sentiment analysis is only returning one value

查看:82
本文介绍了NLTK情绪分析仅返回一个值的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我非常讨厌发布有关整个代码块的问题,但是过去3个小时我一直在研究这个问题,我无法全神贯注于正在发生的事情.我从一个CSV文件中检索了大约600条推文,其分数值在-2到2之间变化,反映了对总统候选人的看法.

I seriously hate to post a question about an entire chunk of code, but I've been working on this for the past 3 hours and I can't wrap my head around what is happening. I have approximately 600 tweets I am retrieving from a CSV file with varying score values (between -2 to 2) reflecting the sentiment towards a presidential candidate.

但是,当我对任何其他数据运行此训练样本时,仅返回一个值(正值).我检查了分数是否正确添加,是否正确.对我来说,从600多种培训中将85,000条推文全部定为阳性",这对我来说是没有意义的.有人知道这里发生了什么吗?谢谢!

However, when I run this training sample on any other data, only one value is returned (positive). I have checked to see if the scores were being added correctly and they are. It just doesn't make sense to me that 85,000 tweets would all be rated "positive" from a diverse training set of 600. Does anyone know what is happening here? Thanks!

import nltk
import csv

tweets = []
import ast
with open('romney.csv', 'rb') as csvfile:
    mycsv = csv.reader(csvfile)
    for row in mycsv:
        tweet = row[1]
        try:
            score = ast.literal_eval(row[12])
            if score > 0:
                print score
                print tweet
                tweets.append((tweet,"positive"))

        elif score < 0:
            print score
            print tweet
            tweets.append((tweet,"negative"))
    except ValueError:
        tweet = ""

def get_words_in_tweets(tweets):
    all_words = []
    for (words, sentiment) in tweets:
      all_words.extend(words)
    return all_words

def get_word_features(wordlist):
    wordlist = nltk.FreqDist(wordlist)
    word_features = wordlist.keys()
    return word_features

def extract_features(document):
    document_words = set(document)
    features = {}
    for word in word_features:
    features['contains(%s)' % word] = (word in document_words)
    return features

word_features = get_word_features(get_words_in_tweets(tweets))
training_set = nltk.classify.apply_features(extract_features, tweets)
classifier = nltk.NaiveBayesClassifier.train(training_set)
c = 0
with open('usa.csv', "rU") as csvfile:
    mycsv = csv.reader(csvfile)
    for row in mycsv:
        try:
            tweet = row[0]
            c = c + 1
                    print classifier.classify(extract_features(tweet.split()))                                                                                                                                                                                     
        except IndexError:
            tweet = ""

推荐答案

朴素贝叶斯分类器通常在评估文档中出现的单词时最有效,而忽略单词的缺失.由于您使用

Naive Bayes Classifier usually works best when evaluating words that appear in the document, ignoring absence of words. Since you use

features['contains(%s)' % word] = (word in document_words)

每个文档主要由值= False的要素表示.

each document is mostly represented by features with a value = False.

尝试类似以下内容:

if word in document_words:
   features['contains(%s)' % word] = True

(您可能还应该更改 for 循环,以使其比在词典中循环所有单词,而是循环到文档中出现的单词更有效).

(you should probably also change the for loop for something more efficient than looping over all words in the lexicon, looping instead on words occurring in the document).

这篇关于NLTK情绪分析仅返回一个值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆