ValueError:太多值无法解包(NLTK分类器) [英] ValueError: too many values to unpack (NLTK classifier)

查看:192
本文介绍了ValueError:太多值无法解包(NLTK分类器)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用NLTK的朴素贝叶斯分类器进行分类分析.我插入一个包含记录和标签的tsv文件.

I'm doing classification analysis using NLTK's Naive Bayes classifier. I insert a tsv file containing records and labels.

但是由于错误,文件没有得到训练.这是我的python代码

But the file doesn't get trained due to an error. Here's my python code

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

dataset = pd.read_csv('tweets.txt', delimiter ='\t', quoting = 3)

dataset.isnull().any()

dataset = dataset.fillna(method='ffill')

import re
import nltk
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
corpus = []
for i in range(0,16004):
    tweet = re.sub('[^a-zA-Z]', ' ', dataset['tweet'][i])
    tweet = tweet.lower()
    tweet = tweet.split()
    ps = PorterStemmer()
    tweet = [ps.stem(word) for word in tweet if not word in 
    set(stopwords.words('english'))]
    tweet = ' '.join(tweet)
    corpus.append(tweet)

from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer(max_features = 10000)
X = cv.fit_transform(corpus).toarray()
y = dataset.iloc[:, 1].values




from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.20, 
random_state = 0)
train_set, test_set = X_train[500:], y_train[:500]

classifier = nltk.NaiveBayesClassifier.train(train_set)

错误是:

File "C:\Users\HSR\Anaconda2\lib\site-packages\nltk\classify\naivebayes.py", line 194, in train
for featureset, label in labeled_featuresets:

ValueError: too many values to unpack

推荐答案

NLTKClassifier不能像scikit估计器一样工作.它要求在单个数组中同时包含Xy,然后将其传递给train().

NLTKClassifier doesn't work like scikit estimators. It requires the X and y both in a single array which is then passed to train().

但是在您的代码中,您仅向其提供X_train,它会尝试从中解压缩y并因此导致错误.

But in your code, you are only supplying it the X_train and it tries to unpack y from that and hence the error.

NaiveBayesClassifier要求输入为元组列表,其中list表示训练样本,元组内部具有功能字典和标签.像这样:

The NaiveBayesClassifier requires the input to be a list of tuples where list denotes the training samples and the tuple has the feature dictionary and label inside. Something like:

X = [({feature1:'val11', feature2:'val12' .... }, class1),
     ({feature1:'val21', feature2:'val22' .... }, class2), 
     ...
     ...                                                  ]

您需要将输入更改为此格式.

You need to change your input to this format.

feature_names = cv.get_feature_names()
train_set = []
for i, single_sample in enumerate(X):
    single_feature_dict = {}
    for j, single_feature in enumerate(single_sample):
        single_feature_dict[feature_names[j]]=single_feature
    train_set.append((single_feature_dict, y[i]))    

注意:可以使用dict理解来缩短上述for循环,但我不太熟练.

Note: The above for loop can be shortened by using dict comprehension but I'm not that fluent there.

然后您可以执行以下操作:

Then you can do this:

nltk.NaiveBayesClassifier.train(train_set)

这篇关于ValueError:太多值无法解包(NLTK分类器)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆