交叉验证和过采样(SMOTE)功能 [英] Function for cross validation and oversampling (SMOTE)

查看:219
本文介绍了交叉验证和过采样(SMOTE)功能的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我写了下面的代码。 X 是形状为(1000,5) y 是形状为(1000,1)的数据框。 y 是要预测的目标数据,并且不平衡。我想应用交叉验证和SMOTE。

I wrote the below code. X is a dataframe with the shape (1000,5) and y is a dataframe with shape (1000,1). y is the target data to predict, and it is imbalanced. I want to apply cross validation and SMOTE.

def Learning(n, est, X, y):
    s_k_fold = StratifiedKFold(n_splits = n)
    acc_scores = []
    rec_scores = []
    f1_scores = []

    for train_index, test_index in s_k_fold.split(X, y): 
        X_train = X[train_index]
        y_train = y[train_index]    

        sm = SMOTE(random_state=42)
        X_resampled, y_resampled = sm.fit_resample(X_train, y_train)

        X_test = X[test_index]
        y_test = y[test_index]

        est.fit(X_resampled, y_resampled)
        y_pred = est.predict(X_test)
        acc_scores.append(accuracy_score(y_test, y_pred))
        rec_scores.append(recall_score(y_test, y_pred))
        f1_scores.append(f1_score(y_test, y_pred)) 

    print('Accuracy:',np.mean(acc_scores))
    print('Recall:',np.mean(rec_scores))
    print('F1:',np.mean(f1_scores)) 

Learning(3, SGDClassifier(), X_train_s_pca, y_train)

运行代码时,以下错误:

When I run the code, I get the below error:


[Int64Index([4231,4235,4246,4250,4255,4295,4317,
4344无,4381,\ 4387,\n ... \n 13122,
13123,13124,13125,13126,13127,13128,13129,13130,\n

13131 ],\n dtype ='int64',length = 8754)]位于[列]中。

None of [Int64Index([ 4231, 4235, 4246, 4250, 4255, 4295, 4317, 4344, 4381,\n 4387,\n ...\n 13122, 13123, 13124, 13125, 13126, 13127, 13128, 13129, 13130,\n
13131],\n dtype='int64', length=8754)] are in the [columns]"

推荐答案

如果仔细观察错误堆栈跟踪(这很重要,但您不包括在内),则您应该看到错误来自这些行(并将来自其他类似的行):

If you observe the error stack trace (which is important but you don't include) carefully, you should see that the error comes from these line (and will come from other similar lines):

X_train = X[train_index]

这种选择仅适用于Numpy数组的行的方式。由于您使用的是Pandas DataFrame,因此应使用 loc

This way of selecting rows only applicable for Numpy array. Since you are using Pandas DataFrame, you should use loc:

X_train = X.loc[train_index]

或者,您也可以使用

Learning(3, SGDClassifier(), X_train_s_pca.values, y_train.values)

这篇关于交叉验证和过采样(SMOTE)功能的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆