向量化:不是有效的集合 [英] Vectorization: Not a valid collection

查看:114
本文介绍了向量化:不是有效的集合的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想向量化一个包含我的OneClassSVM分类器训练语料库的txt文件.为此,我使用了scikit-learn库中的CountVectorizer. 以下是我的代码:

I wanna vectorize a txt file containing my training corpus for the OneClassSVM classifier. For that I'm using CountVectorizer from the scikit-learn library. Here's below my code:

def file_to_corpse(file_name, stop_words):
    array_file = []
    with open(file_name) as fd:
        corp = fd.readlines()
    array_file = np.array(corp)
    stwf = stopwords.words('french')
    for w in stop_words:
        stwf.append(w)
    vectorizer = CountVectorizer(decode_error = 'replace', stop_words=stwf, min_df=1)
    X = vectorizer.fit_transform(array_file)
    return X

当我在文件上运行函数(大约206346行)时,出现以下错误,但似乎无法理解:

When I run my function on my file (around 206346 line) I get the following error and I can't seem to understand it:

Traceback (most recent call last):
  File "svm.py", line 93, in <module>
    clf_svm.fit(training_data)
  File "/home/imane/anaconda/lib/python2.7/site-packages/sklearn/svm/classes.py", line 1028, in fit
    super(OneClassSVM, self).fit(X, np.ones(_num_samples(X)), sample_weight=sample_weight,
  File "/home/imane/anaconda/lib/python2.7/site-packages/sklearn/utils/validation.py", line 122, in _num_samples
    " a valid collection." % x)
TypeError: Singleton array array(<536172x13800 sparse matrix of type '<type 'numpy.int64'>'
    with 1952637 stored elements in Compressed Sparse Row format>, dtype=object) cannot be considered a valid collection.

有人可以帮我解决这个问题吗?我已经被卡住了一段时间了:).

Can somebody please help me solve this problem? I've been stuck for a while :).

推荐答案

如果您查看源代码,可以找到它

If you look at the source, you can find it here for instance, you can find that it checks for this condition to be true (x being your array)

if len(x.shape) == 0:

如果是这样,它将引发此异常

if so, it will raise this exception

TypeError("Singleton array %r cannot be considered a valid collection." % x)

我建议您尝试找出array_file或此函数的返回值的形状长度> 0

What I would suggest is that you try to find out if array_file or your return value from this function has a shape length > 0

这篇关于向量化:不是有效的集合的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆