使用 scikit-learn 训练多维数据 [英] Using scikit-learn to train on multidimensional data

查看:50
本文介绍了使用 scikit-learn 训练多维数据的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

这是一个非常基本的概念:我有不止一个训练依赖项.我的数据都是文本,我有三个单独的字段.我能找到的每个例子都有这样设置的文本数据:

It's a very basic concept: I have more than one dependency for training. My data is all text and I have three separate fields. Every example I have been able to find has text data set up like this:

data = ['text1','text2',...]

我的样子:

data = [['text1','text2','text3'],[...],...]

但是当我尝试适应数据时,我得到以下回溯:

but when I try and fit to the data I get the following traceback:

ValueError                                Traceback (most recent call last)
<ipython-input-25-e3356a0f62f8> in <module>()
----> 1 classifier.fit(X,y)

/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/sklearn/svm/base.pyc in fit(self, X, y, sample_weight)
    140                              "by not using the ``sparse`` parameter")
    141 
--> 142         X = atleast2d_or_csr(X, dtype=np.float64, order='C')
    143 
    144         if self.impl in ['c_svc', 'nu_svc']:

/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/sklearn/utils/validation.pyc in atleast2d_or_csr(X, dtype, order, copy)
    114     """
    115     return _atleast2d_or_sparse(X, dtype, order, copy, sparse.csr_matrix,
--> 116                                 "tocsr")
    117 
    118 

/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/sklearn/utils/validation.pyc in _atleast2d_or_sparse(X, dtype, order, copy, sparse_class, convmethod)
     94         _assert_all_finite(X.data)
     95     else:
---> 96         X = array2d(X, dtype=dtype, order=order, copy=copy)
     97         _assert_all_finite(X)
     98     return X

/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/sklearn/utils/validation.pyc in array2d(X, dtype, order, copy)
     78         raise TypeError('A sparse matrix was passed, but dense data '
     79                         'is required. Use X.toarray() to convert to dense.')
---> 80     X_2d = np.asarray(np.atleast_2d(X), dtype=dtype, order=order)
     81     _assert_all_finite(X_2d)
     82     if X is X_2d and copy:

/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/numpy/core/numeric.pyc in asarray(a, dtype, order)
    318 
    319     """
--> 320     return array(a, dtype, copy=False, order=order)
    321 
    322 def asanyarray(a, dtype=None, order=None):

ValueError: setting an array element with a sequence.

有什么具体的方法可以解决这个问题吗?谢谢!

is there a specific way I have to approach this? Thank you!

注意:

我使用的所有文本数据都由 HashingVectorizer

All of the text data I am using is vectorized by a HashingVectorizer

clf.fit(X,y) 其中 X 是一个包含 3 个矢量化文本的列表,y 是一个列表X 的元素所属的类别

clf.fit(X,y) where X is a list of lists that contain 3 vectorized texts, and y is a list of the respective categories that the element of X belongs to

推荐答案

X 必须是二维数组(或列表列表,如果需要).并且这个列表列表中的每个列表都必须是一个数值列表.所有这些列表必须具有相同的长度.像这样:[[1,2,3,5],[3,4,5,6],[6,7,8,9],...].如果对于每个对象,您有多个要矢量化的文本条目,则需要将生成的矢量化文本组合到一个列表中.例如,如果在您的上下文中有意义,则将它们连接起来.所以最终每个对象都必须由一个列表表示,其中所有条目都是数字.并且所有对象必须由等长的列表表示,其中所有列表中的相应元素表示相同的特征(例如,文本中相同标记的频率).让我知道我说的是否有道理.

X has to be a 2 dimensional array (or list of lists, if you want). And each list in this list of lists has to be a list of numeric values. And all this lists must have the same length. Like this: [[1,2,3,5],[3,4,5,6],[6,7,8,9],...]. If for each object you have several text entries which you are vectorizing, you need to combine the resultant vectorized texts into a single list. For example, concatenating them, if it makes sense in your context. So eventually each object has to be represented by a single list where all entries are numeric. And all objects must be represented by lists of equal length, where corresponding elements in all the lists represent the same feature (e.g. frequency of the same token in your texts). Let me know whether what I'm saying makes sense.

这篇关于使用 scikit-learn 训练多维数据的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆