朴素贝叶斯分类器的n-grams [英] n-grams with Naive Bayes classifier

查看:81
本文介绍了朴素贝叶斯分类器的n-grams的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我是python的新手,需要帮助! 我正在练习python NLTK文本分类. 这是我正在练习的代码示例 http://www.laurentluce.com/posts/twitter-sentiment-analysis-using-python-and-nltk/

Im new to python and need help! i was practicing with python NLTK text classification. Here is the code example i am practicing on http://www.laurentluce.com/posts/twitter-sentiment-analysis-using-python-and-nltk/

我尝试过这个

from nltk import bigrams
from nltk.probability import ELEProbDist, FreqDist
from nltk import NaiveBayesClassifier
from collections import defaultdict

train_samples = {}

with file ('positive.txt', 'rt') as f:
   for line in f.readlines():
       train_samples[line]='pos'

with file ('negative.txt', 'rt') as d:
   for line in d.readlines():
       train_samples[line]='neg'

f=open("test.txt", "r")
test_samples=f.readlines()

def bigramReturner(text):
    tweetString = text.lower()
    bigramFeatureVector = {}
    for item in bigrams(tweetString.split()):
        bigramFeatureVector.append(' '.join(item))
    return bigramFeatureVector

def get_labeled_features(samples):
    word_freqs = {}
    for text, label in train_samples.items():
        tokens = text.split()
        for token in tokens:
            if token not in word_freqs:
                word_freqs[token] = {'pos': 0, 'neg': 0}
            word_freqs[token][label] += 1
    return word_freqs


def get_label_probdist(labeled_features):
    label_fd = FreqDist()
    for item,counts in labeled_features.items():
        for label in ['neg','pos']:
            if counts[label] > 0:
                label_fd.inc(label)
    label_probdist = ELEProbDist(label_fd)
    return label_probdist


def get_feature_probdist(labeled_features):
    feature_freqdist = defaultdict(FreqDist)
    feature_values = defaultdict(set)
    num_samples = len(train_samples) / 2
    for token, counts in labeled_features.items():
        for label in ['neg','pos']:
            feature_freqdist[label, token].inc(True, count=counts[label])
            feature_freqdist[label, token].inc(None, num_samples - counts[label])
            feature_values[token].add(None)
            feature_values[token].add(True)
    for item in feature_freqdist.items():
        print item[0],item[1]
    feature_probdist = {}
    for ((label, fname), freqdist) in feature_freqdist.items():
        probdist = ELEProbDist(freqdist, bins=len(feature_values[fname]))
        feature_probdist[label,fname] = probdist
    return feature_probdist



labeled_features = get_labeled_features(train_samples)

label_probdist = get_label_probdist(labeled_features)

feature_probdist = get_feature_probdist(labeled_features)

classifier = NaiveBayesClassifier(label_probdist, feature_probdist)

for sample in test_samples:
    print "%s | %s" % (sample, classifier.classify(bigramReturner(sample)))

但是出现此错误,为什么?

but getting this error, why?

    Traceback (most recent call last):
  File "C:\python\naive_test.py", line 76, in <module>
    print "%s | %s" % (sample, classifier.classify(bigramReturner(sample)))
  File "C:\python\naive_test.py", line 23, in bigramReturner
    bigramFeatureVector.append(' '.join(item))
AttributeError: 'dict' object has no attribute 'append'

推荐答案

bigram特征向量遵循与unigram特征向量完全相同的原理.因此,就像您提到的教程一样,您将必须检查将要使用的任何文档中是否都存在bigram功能.

A bigram feature vector follows the exact same principals as a unigram feature vector. So, just like the tutorial you mentioned you will have to check if a bigram feature is present in any of the documents you will use.

关于bigram功能以及如何提取它们,我已经为其编写了以下代码.您可以简单地采用它们来更改教程中的变量"tweets".

As for the bigram features and how to extract them, I have written the code bellow for it. You can simply adopt them to change the variable "tweets" in the tutorial.

import nltk
text = "Hi, I want to get the bigram list of this string"
for item in nltk.bigrams (text.split()): print ' '.join(item)

除了打印它们,您只需将它们添加到"tweets"列表中,您就可以开始了!我希望这会有所帮助.否则,请告诉我您是否仍然有问题.

Instead of printing them you can simply append them to the "tweets" list and you are good to go! I hope this would be helpful enough. Otherwise, let me know if you still have problems.

请注意,在诸如情感分析之类的应用中,一些研究人员倾向于对单词进行标记化并删除标点符号,而另一些则不这样做.从经验中我知道,如果不删除标点符号,朴素贝叶斯的工作原理几乎相同,但是SVM的准确率会降低.您可能需要研究这些内容,并确定在您的数据集上更有效的方法.

Please note that in applications like sentiment analysis some researchers tend to tokenize the words and remove the punctuation and some others don't. From experince I know that if you don't remove punctuations, Naive bayes works almost the same, however an SVM would have a decreased accuracy rate. You might need to play around with this stuff and decide what works better on your dataset.

有一本书,名为《使用Python进行自然语言处理》,我可以推荐给您.它包含二元组的示例以及一些练习.但是,我认为您甚至可以在没有它的情况下解决此问题.选择双字母组合特征的想法是,我们想知道单词A出现在我们的语料库中,然后出现单词B的可能性.因此,例如在句子中

There is a book named "Natural language processing with Python" which I can recommend it to you. It contains examples of bigrams as well as some exercises. However, I think you can even solve this case without it. The idea behind selecting bigrams a features is that we want to know the probabilty that word A would appear in our corpus followed by the word B. So, for example in the sentence

我开卡车"

"I drive a truck"

单词unigram特征将是这4个单词中的每个单词,而单词bigram特征将是:

the word unigram features would be each of those 4 words while the word bigram features would be:

[我开车",开车",卡车"]

["I drive", "drive a", "a truck"]

现在您要使用这3个作为功能.因此,下面的代码函数将字符串的所有双字母组放入名为bigramFeatureVector的列表中.

Now you want to use those 3 as your features. So the code function bellow puts all bigrams of a string in a list named bigramFeatureVector.

def bigramReturner (tweetString):
  tweetString = tweetString.lower()
  tweetString = removePunctuation (tweetString)
  bigramFeatureVector = []
  for item in nltk.bigrams(tweetString.split()):
      bigramFeatureVector.append(' '.join(item))
  return bigramFeatureVector

请注意,您必须编写自己的removePunctuation函数.以上函数的输出是二元函数特征向量.您将在提到的教程中完全相同地对待unigram特征向量.

Note that you have to write your own removePunctuation function. What you get as output of the above function is the bigram feature vector. You will treat it exactly the same way the unigram feature vectors are treated in the tutorial you mentioned.

这篇关于朴素贝叶斯分类器的n-grams的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆