Sklearn:带有ColumnTransformer的文本和数字功能具有值错误 [英] Sklearn: Text and Numeric features with ColumnTransformer has value error

查看:352
本文介绍了Sklearn:带有ColumnTransformer的文本和数字功能具有值错误的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在尝试使用新的ColumnTransformer功能时,我试图使用SKLearn 0.20.2制作管道.我的问题是,当我运行分类器:clf.fit(x_train, y_train)时,我不断收到错误消息:

I'm trying to use SKLearn 0.20.2 to make a pipeline while using the new ColumnTransformer feature. My problem is that when I run my classifier: clf.fit(x_train, y_train) I keep getting the error:

ValueError: all the input array dimensions except for the concatenation axis must match exactly

我有一列名为text的文本块.我所有其他专栏本质上都是数字.我正在尝试在管道中使用Countvectorizer,我认为这就是麻烦所在.非常感谢您的帮助.

I have a column of blocks of text called, text. All of my other columns are numerical in nature. I'm trying to use the Countvectorizer in my pipeline and I think that's where the trouble is. Would much appreciate a hand with this.

在运行管道并检查x_train/y_train后,它看起来很有帮助(省略了通常在左栏中显示的行号,而文本列比图片中的行高).

After I run the pipeline and I check my x_train/y_train it looks like this if helpful (omitting the row numbers that normally show in the left column, and the text column runs taller than is shown in the image).

from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
# plus other necessary modules

# mapped to column names from dataframe
numeric_features = ['hasDate', 'iterationCount', 'hasItemNumber', 'isEpic']
numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median'))
])

# mapped to column names from dataframe
text_features = ['text']
text_transformer = Pipeline(steps=[
    ('vect', CountVectorizer())
])

preprocessor = ColumnTransformer(
    transformers=[('num', numeric_transformer, numeric_features),('text', text_transformer, text_features)]
)

clf = Pipeline(steps=[('preprocessor', preprocessor),
                      ('classifier', MultinomialNB())
                     ])

x_train, x_test, y_train, y_test = train_test_split(features, labels, test_size=0.33)
clf.fit(x_train,y_train)

推荐答案

如果您运行此代码,Vadim是正确的

Vadim is correct that if you run this code

numeric_features = ['hasDate', 'iterationCount', 'hasItemNumber', 'isEpic']
numeric_transformer = SimpleImputer(strategy='median')

num = numeric_transformer.fit_transform(df[numeric_features])

# num.shape  
# (3, 4)

text_features = ['text']
text_transformer = CountVectorizer()

text = text_transformer.fit_transform(df[text_features])

print(text_transformer.get_feature_names())
print(text.toarray())

输出将如下所示.

['text']
[[1]]

这是由于我在文本处理过程中遇到了一些不便之处.

This is due to some glitch in the text process that I have come across more than once.

如果您将text_features定义为字符串而不是一个元素列表

If you define your text_features as a string rather than a one-element list

text_features = 'text'
text_transformer = CountVectorizer()

text = text_transformer.fit_transform(df[text_features])

print(text_transformer.get_feature_names())
print(text.toarray())`

成为这个

['123', '16118', '17569', '456', '8779', '9480']
[[0 0 1 0 1 0]
[0 1 0 0 0 1]
[1 0 0 1 0 0]]

您想要的是什么.

将列名作为列表放置会使CountVectorizer出于某种原因仅看到一项

Putting the column name as a list makes the CountVectorizer only see one item for some reason

这篇关于Sklearn:带有ColumnTransformer的文本和数字功能具有值错误的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆