重塑Sklearn的数据 [英] Reshape a data for Sklearn

查看:36
本文介绍了重塑Sklearn的数据的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个颜色列表:

initialColors = [u'black' u'black' u'black' u'white' u'white' u'white' u'powderblue'
 u'whitesmoke' u'black' u'cornflowerblue' u'powderblue' u'powderblue'
 u'goldenrod' u'white' u'lavender' u'white' u'powderblue' u'powderblue'
 u'powderblue' u'powderblue' u'powderblue' u'powderblue' u'powderblue'
 u'powderblue' u'white' u'white' u'powderblue' u'white' u'white']

我有这些颜色的标签,像这样:

And I have a labels for these colors like this:

labels_train = [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1]

0 表示颜色由女性选择, 1 表示男性.我将使用另一种颜色来预测性别.

0 means that a color is chosen by female, 1 means male. And I am going to predict a gender using another one array of colors.

因此,对于我的初始颜色,我将名称转换为数字特征向量,如下所示:

So, for my initial colors I turn the name into numerical feature vectors like this:

from sklearn import preprocessing
le = preprocessing.LabelEncoder()
le.fit(initialColors)
features_train = le.transform(initialColors)

之后,我的 features_train 如下:

[0 0 0 5 5 5 4 6 0 1 4 4 2 5 3 5 4 4 4 4 4 4 4 4 5 5 4 5 5] 

最后,我这样做:

from sklearn.naive_bayes import GaussianNB
clf = GaussianNB()
clf.fit(features_train, labels_train)

但是我有一个错误:

/Library/Python/2.7/site-packages/sklearn/utils/validation.py:395: DeprecationWarning: Passing 1d arrays as data is deprecated in 0.17 and will raise ValueError in 0.19. Reshape your data either using X.reshape(-1, 1) if your data has a single feature or X.reshape(1, -1) if it contains a single sample.
  DeprecationWarning)
Traceback (most recent call last):
  File "app.py", line 36, in <module>
    clf.fit(features_train, labels_train)
  File "/Library/Python/2.7/site-packages/sklearn/naive_bayes.py", line 182, in fit
    X, y = check_X_y(X, y)
  File "/Library/Python/2.7/site-packages/sklearn/utils/validation.py", line 531, in check_X_y
    check_consistent_length(X, y)
  File "/Library/Python/2.7/site-packages/sklearn/utils/validation.py", line 181, in check_consistent_length
    " samples: %r" % [int(l) for l in lengths])
ValueError: Found input variables with inconsistent numbers of samples: [1, 70]

我做到了:

features_train = features_train.reshape(-1, 1)
labels_train = labels_train.reshape(-1, 1)
clf.fit(features_train, labels_train)

我有一个错误:

/Library/Python/2.7/site-packages/sklearn/utils/validation.py:526: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
  y = column_or_1d(y, warn=True)

我也尝试过:

features_train = features_train.reshape(1, -1)
labels_train = labels_train.reshape(1, -1)

但是无论如何:

Traceback (most recent call last):
  File "app.py", line 36, in <module>
    clf.fit(features_train, labels_train)
  File "/Library/Python/2.7/site-packages/sklearn/naive_bayes.py", line 182, in fit
    X, y = check_X_y(X, y)
  File "/Library/Python/2.7/site-packages/sklearn/utils/validation.py", line 526, in check_X_y
    y = column_or_1d(y, warn=True)
  File "/Library/Python/2.7/site-packages/sklearn/utils/validation.py", line 562, in column_or_1d
    raise ValueError("bad input shape {0}".format(shape))
ValueError: bad input shape (1, 29)

我的问题是我不明白在我的情况下重塑数据的最佳方法是什么.您能帮我选择一种重塑数据的方法吗?

My problem is that I don't understand what is the best way to reshape a data in my case. Can you please a help me to choose a way to reshape my data?

推荐答案

快速解答:

  • 执行 features_train = features_train.reshape(-1,1);
  • 请勿执行 labels_train = labels_train.reshape(-1,1).保持 labels_train 不变.
  • Do features_train = features_train.reshape(-1, 1);
  • Do NOT do labels_train = labels_train.reshape(-1, 1). Leave labels_train as it is.

一些细节:

您似乎对估算器需要二维数据数组输入的原因感到困惑.您的训练向量X 有一个形状 (n_samples, n_features).因此, features_train.reshape(-1,1)在这里适用于您的情况,因为您只有1个功能,并且希望让 numpy 推断出有多少个样本.这确实解决了您的第一个错误.

It seems you are confused about the why 2D data array input is required for estimators. Your training vectors X has a shape (n_samples, n_features). So features_train.reshape(-1, 1) is correct for your case here, since you have only 1 feature and want to let numpy to infer how many samples are there. This indeed solves your first error.

您的目标值 y 的形状为(n_samples),该形状需要一维数组.当您执行 labels_train = labels_train.reshape(-1,1)时,会将其转换为2D列向量.这就是为什么您收到第二次警告.请注意,这是一个警告,意思是 fit() 想通了并进行了正确的转换,即您的程序继续运行并且应该是正确的.

Your target values y has a shape (n_samples,), which expects a 1D array. When you do labels_train = labels_train.reshape(-1, 1), you convert it to a 2D column-vector. That's why you got the second warning. Note that it's a warning, meaning fit() figured it out and did the correct conversion, i.e. your program continues to run and should be correct.

当您这样做:

features_train = features_train.reshape(1, -1)
labels_train = labels_train.reshape(1, -1)

首先,对于您的情况, features_train 的转换是错误的,因为 X.reshape(1,-1)表示您有1个样本,并且希望让 numpy 可以推断出其中有多少个功能.这不是您想要的,但是 fit()不知道并将对其进行相应处理,从而给您错误的结果.

First, it is a wrong conversion for features_train for your case here because X.reshape(1, -1) means you have 1 sample and want to let numpy to infer how many features are there. It is not what you want but fit() doesn't know and will process it accordingly, giving you the wrong result.

话虽这么说,您的最后一个错误不是来自 features_train = features_train.reshape(1,-1).它来自 labels_train = labels_train.reshape(1,-1).您的 labels_train 现在具有形状 (1, 29),既不是行也不是列向量.尽管我们可能知道应该将其解释为目标值的一维数组,但是 fit()尚不那么聪明,也不知道该怎么做.

That being said, your last error does not come from features_train = features_train.reshape(1, -1). It is from labels_train = labels_train.reshape(1, -1). Your labels_train has now a shape (1, 29) which is neither a row nor a column-vector. Though we might know it should be interpreted as a 1D array of target values, fit() is not that smart yet and don't know what to do with it.

这篇关于重塑Sklearn的数据的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆