scikit klearn中的FeatureUnion和不兼容的行尺寸 [英] FeatureUnion in scikit klearn and incompatible row dimension

查看:88
本文介绍了scikit klearn中的FeatureUnion和不兼容的行尺寸的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我已经开始使用scikit Learn进行文本提取. 当我在管道中使用标准函数CountVectorizer和TfidfTransformer并尝试与新功能(矩阵的保留性)结合使用时,我遇到了行尺寸问题.

I have started to use scikit learn for text extraction. When I use standard function CountVectorizer and TfidfTransformer in a pipeline and when I try to combine with new features ( a concatention of matrix) I have got a row dimension problem.

这是我的管道:

pipeline = Pipeline([('feats', FeatureUnion([
('ngram_tfidf', Pipeline([('vect', CountVectorizer()),'tfidf', TfidfTransformer())])),
('addned', AddNed()),])), ('clf', SGDClassifier()),])

这是我的类AddNEd,它在每个文档(样本)上添加30个新闻功能.

This is my class AddNEd which add 30 news features on each documents (sample).

class AddNed(BaseEstimator, TransformerMixin):
def __init__(self):
    pass

def transform (self, X, **transform_params):
    do_something
    x_new_feat = np.array(list_feat)
    print(type(X))
    X_np = np.array(X)
    print(X_np.shape, x_new_feat.shape)
    return np.concatenate((X_np, x_new_feat), axis = 1)

def fit(self, X, y=None):
    return self

还有我的主程序的第一部分

And the first part of my main programm

data = load_files('HO_without_tag')
grid_search = GridSearchCV(pipeline, parameters, n_jobs = 1, verbose = 20)
print(len(data.data), len(data.target))
grid_search.fit(X, Y).transform(X)

但是我得到了这个结果:

But I get this result:

486 486
Fitting 3 folds for each of 3456 candidates, totalling 10368 fits
[CV]feats__ngram_tfidf__vect__max_features=3000....
323
<class 'list'>
(323,) (486, 30)

当然还有Indexerror异常

And of course a Indexerror Exception

return np.concatenate((X_np, x_new_feat), axis = 1)
IndexError: axis 1 out of bounds [0, 1

当我在转换函数(类AddNed)中具有参数X时,为什么我没有X的numpy数组(486,3000)形状.我只有(323,)形状.我不明白,因为如果删除Feature Union和AddNed()管道,则CountVectorizer和tf_idf可以正确使用正确的特征和正确的形状. 如果有人有主意? 非常感谢.

When I have the params X in transform function (class AddNed) why I don't have a numpy array (486, 3000) shape for X. I have only (323,) shape. I don't understand because if I delete Feature Union and AddNed() pipeline, CountVectorizer and tf_idf work properly with the right features and the right shape. If anyone have an idea? Thanks a lot.

推荐答案

您可能已经解决了,但是其他人可能也有相同的问题:

You've probably solved it by now, but someone else may have the same problem:

(323, 3000) # X shape Matrix
<class 'scipy.sparse.csr.csr_matrix'>

AddNed尝试将一个矩阵与稀疏矩阵连接起来,应首先将稀疏矩阵转换为稠密矩阵. 我在尝试使用CountVectorizer

AddNed tries to concatenate a matrix with a sparse matrix, the sparse matrix should be transformed to dense matrix first. I've found the same error trying to use the result of CountVectorizer

这篇关于scikit klearn中的FeatureUnion和不兼容的行尺寸的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆