SMOTE为所有类别的数据集提供数组大小/ValueError [英] SMOTE is giving array size / ValueError for all-categorical dataset

查看:108
本文介绍了SMOTE为所有类别的数据集提供数组大小/ValueError的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用SMOTE-NC对我的分类数据进行过采样.我只有1个功能和10500个样本.

I am using SMOTE-NC for oversampling my categorical data. I have only 1 feature and 10500 samples.

在运行以下代码时,出现错误:

While running the below code, I am getting the error:

   ---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-151-a261c423a6d8> in <module>()
     16 print(X_new.shape) # (10500, 1)
     17 print(X_new)
---> 18 sm.fit_sample(X_new, Y_new)

~\AppData\Local\Continuum\Miniconda3\envs\data-science\lib\site-packages\imblearn\base.py in fit_resample(self, X, y)
     81         )
     82 
---> 83         output = self._fit_resample(X, y)
     84 
     85         y_ = (label_binarize(output[1], np.unique(y))

~\AppData\Local\Continuum\Miniconda3\envs\data-science\lib\site-packages\imblearn\over_sampling\_smote.py in _fit_resample(self, X, y)
    926 
    927         X_continuous = X[:, self.continuous_features_]
--> 928         X_continuous = check_array(X_continuous, accept_sparse=["csr", "csc"])
    929         X_minority = _safe_indexing(
    930             X_continuous, np.flatnonzero(y == class_minority)

~\AppData\Local\Continuum\Miniconda3\envs\data-science\lib\site-packages\sklearn\utils\validation.py in check_array(array, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, warn_on_dtype, estimator)
    592                              " a minimum of %d is required%s."
    593                              % (n_features, array.shape, ensure_min_features,
--> 594                                 context))
    595 
    596     if warn_on_dtype and dtype_orig is not None and array.dtype != dtype_orig:

ValueError: Found array with 0 feature(s) (shape=(10500, 0)) while a minimum of 1 is required.

代码:

from imblearn.over_sampling import SMOTE
from imblearn.over_sampling import SMOTENC

sm = SMOTENC(random_state=27,categorical_features=[0,])

X_new = np.array(X_train.values.tolist())
Y_new = np.array(y_train.values.tolist())

print(X_new.shape) # (10500,)
print(Y_new.shape) # (10500,)

X_new = np.reshape(X_new, (-1, 1)) # SMOTE require 2-D Array, Hence changing the shape of X_mew

print(X_new.shape) # (10500, 1)
print(X_new)
sm.fit_sample(X_new, Y_new)

如果我正确理解, X_new 的形状应为(n_samples,n_features),其值为10500 X1.我不确定在ValueError中为什么将其视为shape =(10500,0)

If i understand correctly, the shape of X_new should be (n_samples, n_features) which is 10500 X 1. I am not sure why in the ValueError it is considering it as shape=(10500,0)

有人可以在这里帮助我吗?

Can someone please help me here ?

推荐答案

我在

I have reproduced your issue adapting the example in the docs for a single categorical feature in the data:

from collections import Counter
from numpy.random import RandomState
from sklearn.datasets import make_classification
from imblearn.over_sampling import SMOTENC

X, y = make_classification(n_classes=2, class_sep=2,
 weights=[0.1, 0.9], n_informative=1, n_redundant=0, flip_y=0,
 n_features=1, n_clusters_per_class=1, n_samples=1000, random_state=10)

# simulate the only column to be a categorical feature
X[:, 0] = RandomState(10).randint(0, 4, size=(1000))
X.shape
# (1000, 1)

sm = SMOTENC(random_state=42, categorical_features=[0,]) # same behavior with categorical_features=[0]

X_res, y_res = sm.fit_resample(X, y)

给出相同的错误:

ValueError: Found array with 0 feature(s) (shape=(1000, 0)) while a minimum of 1 is required.

原因实际上很简单,但您必须对原始部分(重点是我的):

The reason is actually quite simple, but you have to dig a little to the original SMOTE paper; quoting from the relevant section (emphasis mine):

虽然我们的SMOTE方法当前使用全部来处理数据集名义特征,它被普遍用于处理混合连续和名义特征.我们称这种方法为合成少数民族过采样技术-名义连续[SMOTE-NC].我们在UCI存储库中的Adult数据集上测试了这种方法.这SMOTE-NC算法如下所述.

While our SMOTE approach currently does not handle data sets with all nominal features, it was generalized to handle mixed datasets of continuous and nominal features. We call this approach Synthetic Minority Over-sampling TEchnique-Nominal Continuous [SMOTE-NC]. We tested this approach on the Adult dataset from the UCI repository. The SMOTE-NC algorithm is described below.

  1. 中位数计算:计算少数类别的所有连续特征的标准偏差的中位数.如果名义上样本与其潜在的最近邻居之间的特征不同,然后将该中值包括在欧几里得距离计算中.我们使用中位数惩罚名义特征的差异一定数量与连续特征的典型差异有关价值观.
  2. 最近邻居计算:计算k个最近邻居所针对的特征向量之间的欧几里得距离识别(少数族裔样本)和其他特征向量(少数族裔样本)使用连续特征空间.对于每个考虑的特征向量与其潜在的最近邻居,包括标准的中位数欧氏距离计算中先前计算的偏差.
  1. Median computation: Compute the median of standard deviations of all continuous features for the minority class. If the nominal features differ between a sample and its potential nearest neighbors, then this median is included in the Euclidean distance computation. We use median to penalize the difference of nominal features by an amount that is related to the typical difference in continuous feature values.
  2. Nearest neighbor computation: Compute the Euclidean distance between the feature vector for which k-nearest neighbors are being identified (minority class sample) and the other feature vectors (minority class samples) using the continuous feature space. For every differing nominal feature between the considered feature vector and its potential nearest-neighbor, include the median of the standard deviations previously computed, in the Euclidean distance computation.

换句话说,尽管没有明确说明,但很明显,为了使算法起作用,它需要至少一个连续特征.情况并非如此,因此该算法毫无疑问地失败了.

In other words, and although not stated explicitly, it is apparent that, in order for the algorithm to work, it needs at least one continuous feature. This is not the case here, so the algorithm rather unsurprisingly fails.

我猜想在内部,在第1步(中值计算)期间,该算法会暂时从数据中删除所有分类特征;在这里这样做时,确实确实遇到了(1000,0)(或者您的情况是(10500,0))形状,即没有数据,因此错误消息中的特定参考.

I guess that, internally, during step 1 (median computation), the algorithm temporarily removes all categorical features from the data; in doing so here, it is faced indeed with a shape of (1000, 0) (or (10500, 0) in your case), i.e. no data, hence the specific reference in the error message.

因此,这里没有任何实际的编程问题可解决,只是您尝试使用SMOTE-NC算法实际上是不可能的(请注意,算法名称中的首字母NC表示 Nominal-连续).

So, there is not any actual programming issue here to be remedied, it's just that what you try to do is actually impossible with the SMOTE-NC algorithm (notice that the very initials NC in the algorithm name mean Nominal-Continuous).

这篇关于SMOTE为所有类别的数据集提供数组大小/ValueError的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆