sklearn维数问题为“找到具有暗3的数组.估计器预期< = 2". [英] sklearn dimensionality issues "Found array with dim 3. Estimator expected <= 2"

查看:59
本文介绍了sklearn维数问题为“找到具有暗3的数组.估计器预期< = 2".的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试使用KNN将.wav文件正确分类为两个组,第0组和第1组.

I am trying to use KNN to correctly classify .wav files into two groups, group 0 and group 1.

我提取了数据,创建了模型,拟合了模型,但是当我尝试使用.predict()方法时,出现以下错误:

I extracted the data, created the model, fit the model, however when I try and use the .predict() method I get the following error:

Traceback (most recent call last):   
File "/..../....../KNN.py", line 20, in <module>
    classifier.fit(X_train, y_train)   
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/sklearn/neighbors/base.py", line 761, in fit
    X, y = check_X_y(X, y, "csr", multi_output=True)   
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/sklearn/utils/validation.py", line 521, in check_X_y
    ensure_min_features, warn_on_dtype, estimator)   
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/sklearn/utils/validation.py", line 405, in check_array
    % (array.ndim, estimator_name)) 
ValueError: Found array with dim 3. Estimator expected <= 2.

我发现了这两个stackoverflow帖子,它们描述了类似的问题:

I have found these two stackoverflow posts which describe similar issues:

sklearn Logistic回归" ValueError:找到了具有暗3的数组.估计量预计为< = 2.."

错误:发现数组为暗3.估计器期望值< = 2

而且,如果我错了,请纠正我,但是scikit-learn似乎只能接受二维数据.

And, correct me if I'm wrong, but it appears that scikit-learn can only accept 2-dimensional data.

我的训练数据具有形状(3240、20、5255)其中包括:

My training data has shape (3240, 20, 5255) Which consists of:

    此数据集中的
  • 3240个.wav文件(这是训练数据的索引0)对于
  • 对于每个 .wav文件,有一个(20,5255)numpy数组,表示MFCC系数(MFCC系数尝试以数字方式表示声音).
  • 3240 .wav files in this dataset (this is index 0 of the training data) For
  • For each .wav file there is a (20, 5255) numpy array which represents the MFCC coefficients (MFCC coefficients try and represent the sound in a numeric way).

我的测试数据的形状为(3240,)#category为0或1

My testing data has shape (3240,) #category is 0 or 1

我可以使用什么代码来操纵我的训练和测试数据,以将其转换为scikit-learn可用的形式?另外,当我从3维降到2维时,如何确保数据不丢失?

What code can I use to manipulated my training and testing data to convert it into a form that is usable by scikit-learn? Also, how can I ensure that data is not lost when I go down from 3 dimensions to 2 dimensions?

推荐答案

是的,sklearn仅适用于2D数据.

It is true, sklearn works only with 2D data.

您可以尝试做的事情:

  • 只需对训练数据使用 np.reshape 即可将其转换为形状(3240,20 * 5255).它将保留所有原始信息.但是sklearn将无法利用此数据中的隐式结构(例如,特征1、21、41等是同一变量的不同版本).
  • 在原始数据上构建卷积神经网络(例如,使用 tensorflow + Keras 堆栈).CNN是专门为处理此类多维数据并利用其结构而设计的.但是它们有很多超参数需要调整.
  • 对数据整形为(3240,20 * 5255)的数据使用降维(例如PCA).它尽力保留尽可能多的信息,同时仍保持少量特征.
  • 使用手动特征工程从数据结构中提取特定信息(例如,每个维度的描述性统计信息),并在此类特征上训练模型.
  • Just use np.reshape on the training data to convert it to shape (3240, 20*5255). It will preserve all the original information. But sklearn will not be able to exploit the implicit structure in this data (e.g. that features 1, 21, 41, etc. are different versions of the same variable).
  • Build a convolutional neural network on your original data (e.g. with tensorflow+Keras stack). CNNs were designed specially to handle such multidimensional data and exploit its structure. But they have lots of hyperparameters to tune.
  • Use dimensionality reduction (e.g. PCA) on the data reshaped to (3240, 20*5255). It fill try to preserve as much information as possible, while still keeping number of features low.
  • Use manual feature engineering to extract specific information from the data structure (e.g. descriptive statistics along each dimension), and train your model on such features.

如果您有更多数据(例如10万个示例),则第一种方法可能效果最好.就您的情况(3K示例和10K功能)而言,您需要对模型进行大量正则化,以避免过度拟合.

If you had more data (e.g. 100K examples), the first approach might work best. In your case (3K examples and 10K features) you need to regularize your model heavily to avoid overfitting.

这篇关于sklearn维数问题为“找到具有暗3的数组.估计器预期&lt; = 2".的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆