重塑pandas.Df以在GridSearch中使用 [英] Reshape pandas.Df to use in GridSearch
问题描述
我正在尝试在带有管道的GridSearch中使用多个功能列.因此,我传递了两列要为其执行TfidfVectorizer的列,但是在运行GridSearch时遇到了麻烦.
I am trying to use multiple feature columns in GridSearch with Pipeline. So I pass two columns for which I want to do a TfidfVectorizer, but I get into trouble when running the GridSearch.
Xs = training_data.loc[:,['text','path_contents']]
y = training_data['class_recoded'].astype('int32')
for col in Xs:
print Xs[col].shape
print Xs.shape
print y.shape
# (2464L,)
# (2464L,)
# (2464, 2)
# (2464L,)
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import GridSearchCV
pipeline = Pipeline([('vectorizer', TfidfVectorizer(encoding="cp1252", stop_words="english")),
('nb', MultinomialNB())])
parameters = {
'vectorizer__max_df': (0.48, 0.5, 0.52,),
'vectorizer__max_features': (None, 8500, 9000, 9500),
'vectorizer__ngram_range': ((1, 3), (1, 4), (1, 5)),
'vectorizer__use_idf': (False, True)
}
if __name__ == "__main__":
grid_search = GridSearchCV(pipeline, parameters, n_jobs=-1, verbose=2)
grid_search.fit(Xs, y) # <- error thrown here
print("Best score: {0}".format(grid_search.best_score_))
print("Best parameters set:")
best_parameters = grid_search.best_estimator_.get_params()
for param_name in sorted(list(parameters.keys())):
print("\t{0}: {1}".format(param_name, best_parameters[param_name]))
错误:ValueError:找到输入样本数量不一致的输入变量:[2,1642]
Error: ValueError: Found input variables with inconsistent numbers of samples: [2, 1642]
I read a similar error here and here, and I tried both questions' suggestions but to no avail.
我尝试以其他方式选择数据:
I tried selecting my data in a different way:
features = ['text', 'path_contents']
Xs = training_data[features]
我尝试使用.values
代替建议的此处,例如:
I tried using .values
instead as suggested here, like so:
grid_search.fit(Xs.values, y.values)
但这给了我以下错误:
AttributeError:'numpy.ndarray'对象没有属性'lower'
AttributeError: 'numpy.ndarray' object has no attribute 'lower'
那是怎么回事?我不确定该如何继续.
So what's going on? I'm not sure how to continue from this.
推荐答案
TfidfVectorizer期望输入一个字符串列表.这说明"AttributeError:'numpy.ndarray'对象没有属性'lower'",因为您输入了2d-array,这表示数组列表.
TfidfVectorizer expects input a list of strings. That explains "AttributeError: 'numpy.ndarray' object has no attribute 'lower'" because you input 2d-array, which means a list of arrays.
因此,您有2种选择,既可以将2列预先转换为1列(以熊猫为单位),或者如果要保留2列,则可以在管道中使用要素联合(
So you have 2 choices, either concat 2 columns into 1 column beforehand (in pandas) or if you want to keep 2 columns, you could use feature union in the pipeline (http://scikit-learn.org/stable/modules/pipeline.html#feature-union)
关于第一个例外,我想这是由于熊猫和sklearn之间的交流引起的.但是由于上述代码错误,您无法确定.
About the first exception, I guess it's caused by the communication between pandas and sklearn. However you cannot tell for sure because of the above error in the code.
这篇关于重塑pandas.Df以在GridSearch中使用的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!