用于情感分析的 nltk NaiveBayesClassifier 训练 [英] nltk NaiveBayesClassifier training for sentiment analysis
问题描述
我正在使用句子在 Python 中训练 NaiveBayesClassifier
,它给了我以下错误.我不明白错误可能是什么,任何帮助都会很好.
我尝试了许多其他输入格式,但错误仍然存在.下面给出的代码:
from text.classifiers import NaiveBayesClassifier从 text.blob 导入 TextBlobtrain = [('我喜欢这个三明治.', 'pos'),('这是一个了不起的地方!', 'pos'),('我对这些啤酒感觉很好.', 'pos'),('这是我最好的作品.', 'pos'),("多棒的景色", 'pos'),('我不喜欢这家餐厅', 'neg'),('我厌倦了这些东西.', '否定'),(我不能处理这个",'否定'),('他是我的死敌!','否定'),('我的老板太可怕了.','否定')]test = [('啤酒很好喝.', 'pos'),('我不喜欢我的工作', '否定'),(我今天感觉不舒服.",'否定'),("我感觉棒极了!", 'pos'),('Gary 是我的朋友.', 'pos'),(我不敢相信我在做这个.",'否定')]分类器 = nltk.NaiveBayesClassifier.train(train)
我在下面包含了回溯.
回溯(最近一次调用最后一次):文件C:Users5460Desktop rain01.py",第 15 行,在 <module> 中all_words = set(word.lower() for pass in train for word in word_tokenize(passage[0])) 中的文件C:Users5460Desktop rain01.py",第 15 行all_words = set(word.lower() for pass in train for word in word_tokenize(passage[0]))文件C:Python27libsite-packages
ltk okenize\__init__.py",第 87 行,在 word_tokenize返回_word_tokenize(文本)文件C:Python27libsite-packages
ltk okenize reebank.py",第 67 行,标记化text = re.sub(r'^"', r'``', 文字)文件C:Python27lib
e.py",第 151 行,在 sub返回 _compile(pattern, flags).sub(repl, string, count)类型错误:预期的字符串或缓冲区
您需要更改数据结构.这是您目前的 train
列表:
不过,问题是每个元组的第一个元素应该是一个特征字典.因此,我会将您的列表更改为分类器可以使用的数据结构:
<预><代码>>>>from nltk.tokenize import word_tokenize # 或使用其他标记器>>>all_words = set(word.lower() for pass in train for word in word_tokenize(passage[0]))>>>t = [({word: (word in word_tokenize(x[0])) for word in all_words}, x[1]) for x in train]您的数据现在应该是这样的:
<预><代码>>>>吨[({'this': True, 'love': True, 'deal': False, 'tired': False, 'feel': False, 'is': False, 'am': False, 'an': False,'三明治':真,'ca':假,'最好':假,'!':假,'什么':假,'.':真,'惊人':假,'可怕':假,'宣誓':假,'真棒':假,'做':假,'好':假,'非常':假,'老板':假,'啤酒':假,'不':假,'与': False, 'he': False, 'enemy': False, 'about': False, 'like': False, 'restaurant': False, 'these': False, 'of': False, 'work': False, "n't": False, 'i': False, 'stuff': False, 'place': False, 'my': False, 'view': False}, 'pos'), ...]请注意,每个元组的第一个元素现在是一个字典.现在您的数据已准备就绪,并且每个元组的第一个元素是一个字典,您可以像这样训练分类器:
<预><代码>>>>导入 nltk>>>分类器 = nltk.NaiveBayesClassifier.train(t)>>>分类器.show_most_informative_features()信息量最大的功能this = True 否定:pos = 2.3:1.0这 = 假 pos : neg = 1.8 : 1.0an = 假否定:pos = 1.6:1.0.= 真 pos : neg = 1.4 : 1.0.= 假否定:pos = 1.4:1.0真棒 = 假否定:pos = 1.2:1.0of = False pos : neg = 1.2 : 1.0感觉 = 假否定:pos = 1.2:1.0地方 = 假否定:pos = 1.2:1.0可怕的 = 错误的 pos : neg = 1.2 : 1.0如果你想使用分类器,你可以这样做.首先,你从一个测试语句开始:
<预><代码>>>>test_sentence = "这是我听过的最好的乐队!"然后,您对句子进行标记并找出该句子与 all_words 共享哪些单词.这些构成了句子的特征.
<预><代码>>>>test_sent_features = {word: (word_tokenize(test_sentence.lower())) for word in all_words}您的功能现在将如下所示:
<预><代码>>>>test_sent_features{'love': False, 'deal': False, 'tired': False, 'feel': False, 'is': True, 'am': False, 'an': False, 'sandwich': False, 'ca': False, 'best': True, '!': True, 'what': False, 'i': True, '.': False, 'amazing': False, 'horrible': False, 'sworn': 错误, 'awesome': 错误, 'do': 错误, 'good': 错误, 'very': 错误, 'boss': 错误, 'beers': 错误, 'not': 错误, 'with': 错误, 'he': False, 'enemy': False, 'about': False, 'like': False, 'restaurant': False, 'this': True, 'of': False, 'work': False, "n't": False, 'these': False, 'stuff': False, 'place': False, 'my': False, 'view': False}然后您只需对这些特征进行分类:
<预><代码>>>>分类器.分类(test_sent_features)'pos' # 注意上面句子特征中的 'best' == True这个测试句子似乎是肯定的.
I am training the NaiveBayesClassifier
in Python using sentences, and it gives me the error below. I do not understand what the error might be, and any help would be good.
I have tried many other input formats, but the error remains. The code given below:
from text.classifiers import NaiveBayesClassifier
from text.blob import TextBlob
train = [('I love this sandwich.', 'pos'),
('This is an amazing place!', 'pos'),
('I feel very good about these beers.', 'pos'),
('This is my best work.', 'pos'),
("What an awesome view", 'pos'),
('I do not like this restaurant', 'neg'),
('I am tired of this stuff.', 'neg'),
("I can't deal with this", 'neg'),
('He is my sworn enemy!', 'neg'),
('My boss is horrible.', 'neg') ]
test = [('The beer was good.', 'pos'),
('I do not enjoy my job', 'neg'),
("I ain't feeling dandy today.", 'neg'),
("I feel amazing!", 'pos'),
('Gary is a friend of mine.', 'pos'),
("I can't believe I'm doing this.", 'neg') ]
classifier = nltk.NaiveBayesClassifier.train(train)
I am including the traceback below.
Traceback (most recent call last):
File "C:Users5460Desktop rain01.py", line 15, in <module>
all_words = set(word.lower() for passage in train for word in word_tokenize(passage[0]))
File "C:Users5460Desktop rain01.py", line 15, in <genexpr>
all_words = set(word.lower() for passage in train for word in word_tokenize(passage[0]))
File "C:Python27libsite-packages
ltk okenize\__init__.py", line 87, in word_tokenize
return _word_tokenize(text)
File "C:Python27libsite-packages
ltk okenize reebank.py", line 67, in tokenize
text = re.sub(r'^"', r'``', text)
File "C:Python27lib
e.py", line 151, in sub
return _compile(pattern, flags).sub(repl, string, count)
TypeError: expected string or buffer
You need to change your data structure. Here is your train
list as it currently stands:
>>> train = [('I love this sandwich.', 'pos'),
('This is an amazing place!', 'pos'),
('I feel very good about these beers.', 'pos'),
('This is my best work.', 'pos'),
("What an awesome view", 'pos'),
('I do not like this restaurant', 'neg'),
('I am tired of this stuff.', 'neg'),
("I can't deal with this", 'neg'),
('He is my sworn enemy!', 'neg'),
('My boss is horrible.', 'neg')]
The problem is, though, that the first element of each tuple should be a dictionary of features. So I will change your list into a data structure that the classifier can work with:
>>> from nltk.tokenize import word_tokenize # or use some other tokenizer
>>> all_words = set(word.lower() for passage in train for word in word_tokenize(passage[0]))
>>> t = [({word: (word in word_tokenize(x[0])) for word in all_words}, x[1]) for x in train]
Your data should now be structured like this:
>>> t
[({'this': True, 'love': True, 'deal': False, 'tired': False, 'feel': False, 'is': False, 'am': False, 'an': False, 'sandwich': True, 'ca': False, 'best': False, '!': False, 'what': False, '.': True, 'amazing': False, 'horrible': False, 'sworn': False, 'awesome': False, 'do': False, 'good': False, 'very': False, 'boss': False, 'beers': False, 'not': False, 'with': False, 'he': False, 'enemy': False, 'about': False, 'like': False, 'restaurant': False, 'these': False, 'of': False, 'work': False, "n't": False, 'i': False, 'stuff': False, 'place': False, 'my': False, 'view': False}, 'pos'), . . .]
Note that the first element of each tuple is now a dictionary. Now that your data is in place and the first element of each tuple is a dictionary, you can train the classifier like so:
>>> import nltk
>>> classifier = nltk.NaiveBayesClassifier.train(t)
>>> classifier.show_most_informative_features()
Most Informative Features
this = True neg : pos = 2.3 : 1.0
this = False pos : neg = 1.8 : 1.0
an = False neg : pos = 1.6 : 1.0
. = True pos : neg = 1.4 : 1.0
. = False neg : pos = 1.4 : 1.0
awesome = False neg : pos = 1.2 : 1.0
of = False pos : neg = 1.2 : 1.0
feel = False neg : pos = 1.2 : 1.0
place = False neg : pos = 1.2 : 1.0
horrible = False pos : neg = 1.2 : 1.0
If you want to use the classifier, you can do it like this. First, you begin with a test sentence:
>>> test_sentence = "This is the best band I've ever heard!"
Then, you tokenize the sentence and figure out which words the sentence shares with all_words. These constitute the sentence's features.
>>> test_sent_features = {word: (word in word_tokenize(test_sentence.lower())) for word in all_words}
Your features will now look like this:
>>> test_sent_features
{'love': False, 'deal': False, 'tired': False, 'feel': False, 'is': True, 'am': False, 'an': False, 'sandwich': False, 'ca': False, 'best': True, '!': True, 'what': False, 'i': True, '.': False, 'amazing': False, 'horrible': False, 'sworn': False, 'awesome': False, 'do': False, 'good': False, 'very': False, 'boss': False, 'beers': False, 'not': False, 'with': False, 'he': False, 'enemy': False, 'about': False, 'like': False, 'restaurant': False, 'this': True, 'of': False, 'work': False, "n't": False, 'these': False, 'stuff': False, 'place': False, 'my': False, 'view': False}
Then you simply classify those features:
>>> classifier.classify(test_sent_features)
'pos' # note 'best' == True in the sentence features above
This test sentence appears to be positive.
这篇关于用于情感分析的 nltk NaiveBayesClassifier 训练的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!