用于情感分析的 nltk NaiveBayesClassifier 训练 [英] nltk NaiveBayesClassifier training for sentiment analysis

查看:39
本文介绍了用于情感分析的 nltk NaiveBayesClassifier 训练的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用句子在 Python 中训练 NaiveBayesClassifier,它给了我以下错误.我不明白错误可能是什么,任何帮助都会很好.

我尝试了许多其他输入格式,但错误仍然存​​在.下面给出的代码:

from text.classifiers import NaiveBayesClassifier从 text.blob 导入 TextBlobtrain = [('我喜欢这个三明治.', 'pos'),('这是一个了不起的地方!', 'pos'),('我对这些啤酒感觉很好.', 'pos'),('这是我最好的作品.', 'pos'),("多棒的景色", 'pos'),('我不喜欢这家餐厅', 'neg'),('我厌倦了这些东西.', '否定'),(我不能处理这个",'否定'),('他是我的死敌!','否定'),('我的老板太可怕了.','否定')]test = [('啤酒很好喝.', 'pos'),('我不喜欢我的工作', '否定'),(我今天感觉不舒服.",'否定'),("我感觉棒极了!", 'pos'),('Gary 是我的朋友.', 'pos'),(我不敢相信我在做这个.",'否定')]分类器 = nltk.NaiveBayesClassifier.train(train)

我在下面包含了回溯.

回溯(最近一次调用最后一次):文件C:Users5460Desktop	rain01.py",第 15 行,在 <module> 中all_words = set(word.lower() for pass in train for word in word_tokenize(passage[0])) 中的文件C:Users5460Desktop	rain01.py",第 15 行all_words = set(word.lower() for pass in train for word in word_tokenize(passage[0]))文件C:Python27libsite-packages
ltk	okenize\__init__.py",第 87 行,在 word_tokenize返回_word_tokenize(文本)文件C:Python27libsite-packages
ltk	okenize	reebank.py",第 67 行,标记化text = re.sub(r'^"', r'``', 文字)文件C:Python27lib
e.py",第 151 行,在 sub返回 _compile(pattern, flags).sub(repl, string, count)类型错误:预期的字符串或缓冲区

解决方案

您需要更改数据结构.这是您目前的 train 列表:

<预><代码>>>>train = [('我喜欢这个三明治.', 'pos'),('这是一个了不起的地方!', 'pos'),('我对这些啤酒感觉很好.', 'pos'),('这是我最好的作品.', 'pos'),("多棒的景色", 'pos'),('我不喜欢这家餐厅', 'neg'),('我厌倦了这些东西.', '否定'),(我不能处理这个",'否定'),('他是我的死敌!','否定'),('我的老板太可怕了.','否定')]

不过,问题是每个元组的第一个元素应该是一个特征字典.因此,我会将您的列表更改为分类器可以使用的数据结构:

<预><代码>>>>from nltk.tokenize import word_tokenize # 或使用其他标记器>>>all_words = set(word.lower() for pass in train for word in word_tokenize(passage[0]))>>>t = [({word: (word in word_tokenize(x[0])) for word in all_words}, x[1]) for x in train]

您的数据现在应该是这样的:

<预><代码>>>>吨[({'this': True, 'love': True, 'deal': False, 'tired': False, 'feel': False, 'is': False, 'am': False, 'an': False,'三明治':真,'ca':假,'最好':假,'!':假,'什么':假,'.':真,'惊人':假,'可怕':假,'宣誓':假,'真棒':假,'做':假,'好':假,'非常':假,'老板':假,'啤酒':假,'不':假,'与': False, 'he': False, 'enemy': False, 'about': False, 'like': False, 'restaurant': False, 'these': False, 'of': False, 'work': False, "n't": False, 'i': False, 'stuff': False, 'place': False, 'my': False, 'view': False}, 'pos'), ...]

请注意,每个元组的第一个元素现在是一个字典.现在您的数据已准备就绪,并且每个元组的第一个元素是一个字典,您可以像这样训练分类器:

<预><代码>>>>导入 nltk>>>分类器 = nltk.NaiveBayesClassifier.train(t)>>>分类器.show_most_informative_features()信息量最大的功能this = True 否定:pos = 2.3:1.0这 = 假 pos : neg = 1.8 : 1.0an = 假否定:pos = 1.6:1.0.= 真 pos : neg = 1.4 : 1.0.= 假否定:pos = 1.4:1.0真棒 = 假否定:pos = 1.2:1.0of = False pos : neg = 1.2 : 1.0感觉 = 假否定:pos = 1.2:1.0地方 = 假否定:pos = 1.2:1.0可怕的 = 错误的 pos : neg = 1.2 : 1.0

如果你想使用分类器,你可以这样做.首先,你从一个测试语句开始:

<预><代码>>>>test_sentence = "这是我听过的最好的乐队!"

然后,您对句子进行标记并找出该句子与 all_words 共享哪些单词.这些构成了句子的特征.

<预><代码>>>>test_sent_features = {word: (word_tokenize(test_sentence.lower())) for word in all_words}

您的功能现在将如下所示:

<预><代码>>>>test_sent_features{'love': False, 'deal': False, 'tired': False, 'feel': False, 'is': True, 'am': False, 'an': False, 'sandwich': False, 'ca': False, 'best': True, '!': True, 'what': False, 'i': True, '.': False, 'amazing': False, 'horrible': False, 'sworn': 错误, 'awesome': 错误, 'do': 错误, 'good': 错误, 'very': 错误, 'boss': 错误, 'beers': 错误, 'not': 错误, 'with': 错误, 'he': False, 'enemy': False, 'about': False, 'like': False, 'restaurant': False, 'this': True, 'of': False, 'work': False, "n't": False, 'these': False, 'stuff': False, 'place': False, 'my': False, 'view': False}

然后您只需对这些特征进行分类:

<预><代码>>>>分类器.分类(test_sent_features)'pos' # 注意上面句子特征中的 'best' == True

这个测试句子似乎是肯定的.

I am training the NaiveBayesClassifier in Python using sentences, and it gives me the error below. I do not understand what the error might be, and any help would be good.

I have tried many other input formats, but the error remains. The code given below:

from text.classifiers import NaiveBayesClassifier
from text.blob import TextBlob
train = [('I love this sandwich.', 'pos'),
         ('This is an amazing place!', 'pos'),
         ('I feel very good about these beers.', 'pos'),
         ('This is my best work.', 'pos'),
         ("What an awesome view", 'pos'),
         ('I do not like this restaurant', 'neg'),
         ('I am tired of this stuff.', 'neg'),
         ("I can't deal with this", 'neg'),
         ('He is my sworn enemy!', 'neg'),
         ('My boss is horrible.', 'neg') ]

test = [('The beer was good.', 'pos'),
        ('I do not enjoy my job', 'neg'),
        ("I ain't feeling dandy today.", 'neg'),
        ("I feel amazing!", 'pos'),
        ('Gary is a friend of mine.', 'pos'),
        ("I can't believe I'm doing this.", 'neg') ]
classifier = nltk.NaiveBayesClassifier.train(train)

I am including the traceback below.

Traceback (most recent call last):
  File "C:Users5460Desktop	rain01.py", line 15, in <module>
    all_words = set(word.lower() for passage in train for word in word_tokenize(passage[0]))
  File "C:Users5460Desktop	rain01.py", line 15, in <genexpr>
    all_words = set(word.lower() for passage in train for word in word_tokenize(passage[0]))
  File "C:Python27libsite-packages
ltk	okenize\__init__.py", line 87, in word_tokenize
    return _word_tokenize(text)
  File "C:Python27libsite-packages
ltk	okenize	reebank.py", line 67, in tokenize
    text = re.sub(r'^"', r'``', text)
  File "C:Python27lib
e.py", line 151, in sub
    return _compile(pattern, flags).sub(repl, string, count)
TypeError: expected string or buffer

解决方案

You need to change your data structure. Here is your train list as it currently stands:

>>> train = [('I love this sandwich.', 'pos'),
('This is an amazing place!', 'pos'),
('I feel very good about these beers.', 'pos'),
('This is my best work.', 'pos'),
("What an awesome view", 'pos'),
('I do not like this restaurant', 'neg'),
('I am tired of this stuff.', 'neg'),
("I can't deal with this", 'neg'),
('He is my sworn enemy!', 'neg'),
('My boss is horrible.', 'neg')]

The problem is, though, that the first element of each tuple should be a dictionary of features. So I will change your list into a data structure that the classifier can work with:

>>> from nltk.tokenize import word_tokenize # or use some other tokenizer
>>> all_words = set(word.lower() for passage in train for word in word_tokenize(passage[0]))
>>> t = [({word: (word in word_tokenize(x[0])) for word in all_words}, x[1]) for x in train]

Your data should now be structured like this:

>>> t
[({'this': True, 'love': True, 'deal': False, 'tired': False, 'feel': False, 'is': False, 'am': False, 'an': False, 'sandwich': True, 'ca': False, 'best': False, '!': False, 'what': False, '.': True, 'amazing': False, 'horrible': False, 'sworn': False, 'awesome': False, 'do': False, 'good': False, 'very': False, 'boss': False, 'beers': False, 'not': False, 'with': False, 'he': False, 'enemy': False, 'about': False, 'like': False, 'restaurant': False, 'these': False, 'of': False, 'work': False, "n't": False, 'i': False, 'stuff': False, 'place': False, 'my': False, 'view': False}, 'pos'), . . .]

Note that the first element of each tuple is now a dictionary. Now that your data is in place and the first element of each tuple is a dictionary, you can train the classifier like so:

>>> import nltk
>>> classifier = nltk.NaiveBayesClassifier.train(t)
>>> classifier.show_most_informative_features()
Most Informative Features
                    this = True              neg : pos    =      2.3 : 1.0
                    this = False             pos : neg    =      1.8 : 1.0
                      an = False             neg : pos    =      1.6 : 1.0
                       . = True              pos : neg    =      1.4 : 1.0
                       . = False             neg : pos    =      1.4 : 1.0
                 awesome = False             neg : pos    =      1.2 : 1.0
                      of = False             pos : neg    =      1.2 : 1.0
                    feel = False             neg : pos    =      1.2 : 1.0
                   place = False             neg : pos    =      1.2 : 1.0
                horrible = False             pos : neg    =      1.2 : 1.0

If you want to use the classifier, you can do it like this. First, you begin with a test sentence:

>>> test_sentence = "This is the best band I've ever heard!"

Then, you tokenize the sentence and figure out which words the sentence shares with all_words. These constitute the sentence's features.

>>> test_sent_features = {word: (word in word_tokenize(test_sentence.lower())) for word in all_words}

Your features will now look like this:

>>> test_sent_features
{'love': False, 'deal': False, 'tired': False, 'feel': False, 'is': True, 'am': False, 'an': False, 'sandwich': False, 'ca': False, 'best': True, '!': True, 'what': False, 'i': True, '.': False, 'amazing': False, 'horrible': False, 'sworn': False, 'awesome': False, 'do': False, 'good': False, 'very': False, 'boss': False, 'beers': False, 'not': False, 'with': False, 'he': False, 'enemy': False, 'about': False, 'like': False, 'restaurant': False, 'this': True, 'of': False, 'work': False, "n't": False, 'these': False, 'stuff': False, 'place': False, 'my': False, 'view': False}

Then you simply classify those features:

>>> classifier.classify(test_sent_features)
'pos' # note 'best' == True in the sentence features above

This test sentence appears to be positive.

这篇关于用于情感分析的 nltk NaiveBayesClassifier 训练的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆