过度采样类别不平衡训练/测试拆分“找到的输入变量样本数量不一致”解？ [英] Over-Sampling Class Imbalance Train/Test Split "Found input variables with inconsistent numbers of samples" Solution?

查看：277 发布时间：2020/10/2 3:18:28 python pandas scikit-learn classification oversampling

本文介绍了过度采样类别不平衡训练/测试拆分“找到的输入变量样本数量不一致”解？的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

尝试按照本文进行过采样以实现不平衡分类。我的课堂比率是8：1。

Trying to follow this article to perform over-sampling for imbalanced classification. My class ratio is about 8:1.

> https://www.kaggle.com/rafjaa/resampling-strategies-for-imbalanced-datasets/notebook

我对管道+编码结构感到困惑。

I am confused on the pipeline + coding structure.

在火车/测试分裂后，您是否会过度采样？

如果是，您如何处理目标标签从X掉落的事实？我尝试保留它，然后执行过采样，然后在X_train / X_test上放置标签，并在管道
中替换了新的训练集，但是我收到错误消息 发现输入变量的样本数不一致 >，因为形状不一致，因为新的过采样df以50/50的标签分布加倍。

Should you over-sample after train/test splitting?

If so, how do you deal with the fact that the target label is dropped from X? I tried keeping it and then performed the over-sampling then dropped labels on X_train/X_test and replaced the new training set in my pipeline however i get error "Found input variables with inconsistent numbers of samples" because the shapes are inconsistent since the new over-sampling df is doubled with a 50/50 label distribution.

我理解这个问题，但是当想要执行过度采样以减少类不平衡时，如何解决这个问题呢？


    X = df
    #X = df.drop("label", axis=1)
    y = df["label"]

    X_train,\
    X_test,\
    y_train,\
    y_test = train_test_split(X,\
                              y,\
                              test_size=0.2,\
                              random_state=11,\
                              shuffle=True,\
                              stratify=target)

    target_count = df.label.value_counts()
    print('Class 1:', target_count[0])
    print('Class 0:', target_count[1])
    print('Proportion:', round(target_count[0] / target_count[1], 2), ': 1')

    target_count.plot(kind='bar', title='Count (target)');

    # Class count
    count_class_index_0, count_class_index_1 = X_train.label.value_counts()

    # Divide by class
    count_class_index_0 = X_train[X_train['label'] == '1']
    count_class_index_1 = X_train[X_train['label'] == '0']

    df_class_1_over = df_class_1.sample(count_class_index_0, replace=True)
    df_test_over = pd.concat([count_class_index_0, df_class_1_over], axis=0)

    print('Random over-sampling:')
    print(df_test_over.label.value_counts())

    Random over-sampling:
    1    12682
    0      12682

    df_test_over.label.value_counts().plot(kind='bar', title='Count (target)')

    # drop label for new X_train and X_test
    X_train_OS = df_test_over.drop("label", axis=1)
    X_test = X_test.drop("label", axis=1)

    print(X_train_OS.shape)
    print(X_test.shape)

    print(y_train.shape)
    print(y_test.shape)

    (25364, 9)
    (3552, 9)
    (14207,)
    (3552,)

    cat_transformer = Pipeline(steps=[
        ('cat_imputer', SimpleImputer(strategy='constant', fill_value='missing')),
        ('cat_ohe', OneHotEncoder(handle_unknown='ignore'))])

    num_transformer = Pipeline(steps=[
        ('num_imputer', SimpleImputer(strategy='constant', fill_value=0)),
        ('num_scaler', StandardScaler())])

    text_transformer_0 = Pipeline(steps=[
        ('text_bow', CountVectorizer(lowercase=True,\
                                     token_pattern=SPLIT_PATTERN,\
                                     stop_words=stopwords))])
    # SelectKBest()
    # TruncatedSVD()

    text_transformer_1 = Pipeline(steps=[
        ('text_bow', CountVectorizer(lowercase=True,\
                                     token_pattern=SPLIT_PATTERN,\
                                     stop_words=stopwords))])
    # SelectKBest()
    # TruncatedSVD()

    FE = ColumnTransformer(
        transformers=[
            ('cat', cat_transformer, CAT_FEATURES),
            ('num', num_transformer, NUM_FEATURES),
            ('text0', text_transformer_0, TEXT_FEATURES[0]),
            ('text1', text_transformer_1, TEXT_FEATURES[1])])

    pipe = Pipeline(steps=[('feature_engineer', FE),
                         ("scales", MaxAbsScaler()),
                         ('rand_forest', RandomForestClassifier(n_jobs=-1, class_weight='balanced'))])

    random_grid = {"rand_forest__max_depth": [3, 10, 100, None],\
                  "rand_forest__n_estimators": sp_randint(10, 100),\
                  "rand_forest__max_features": ["auto", "sqrt", "log2", None],\
                  "rand_forest__bootstrap": [True, False],\
                  "rand_forest__criterion": ["gini", "entropy"]}

    strat_shuffle_fold = StratifiedKFold(n_splits=5,\
      random_state=123,\
      shuffle=True)

    cv_train = RandomizedSearchCV(pipe, param_distributions=random_grid, cv=strat_shuffle_fold)
    cv_train.fit(X_train_OS, y_train)

    from sklearn.metrics import classification_report, confusion_matrix
    preds = cv_train.predict(X_test)
    print(confusion_matrix(y_test, preds))
    print(classification_report(y_test, preds))

过度采样类别不平衡训练/测试拆分“找到的输入变量样本数量不一致”解？ [英] Over-Sampling Class Imbalance Train/Test Split "Found input variables with inconsistent numbers of samples" Solution?

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

过度采样类别不平衡训练/测试拆分“找到的输入变量样本数量不一致”解？ [英] Over-Sampling Class Imbalance Train/Test Split &quot;Found input variables with inconsistent numbers of samples&quot; Solution?

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

过度采样类别不平衡训练/测试拆分“找到的输入变量样本数量不一致”解？ [英] Over-Sampling Class Imbalance Train/Test Split "Found input variables with inconsistent numbers of samples" Solution?

登录关闭