过度采样类别不平衡训练/测试拆分“找到的输入变量样本数量不一致”解? [英] Over-Sampling Class Imbalance Train/Test Split "Found input variables with inconsistent numbers of samples" Solution?

查看:277
本文介绍了过度采样类别不平衡训练/测试拆分“找到的输入变量样本数量不一致”解?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

尝试按照本文进行过采样以实现不平衡分类。我的课堂比率是8:1。

Trying to follow this article to perform over-sampling for imbalanced classification. My class ratio is about 8:1.

> https://www.kaggle.com/rafjaa/resampling-strategies-for-imbalanced-datasets/notebook

我对管道+编码结构感到困惑。

I am confused on the pipeline + coding structure.


  • 在火车/测试分裂后,您是否会过度采样?


    • 如果是,您如何处理目标标签从X掉落的事实?我尝试保留它,然后执行过采样,然后在X_train / X_test上放置标签,并在管道
      中替换了新的训练集,但是我收到错误消息 发现输入变量的样本数不一致 >,因为形状不一致,因为新的过采样df以50/50的标签分布加倍。

    • Should you over-sample after train/test splitting?
      • If so, how do you deal with the fact that the target label is dropped from X? I tried keeping it and then performed the over-sampling then dropped labels on X_train/X_test and replaced the new training set in my pipeline however i get error "Found input variables with inconsistent numbers of samples" because the shapes are inconsistent since the new over-sampling df is doubled with a 50/50 label distribution.

      我理解这个问题,但是当想要执行过度采样以减少类不平衡时,如何解决这个问题呢?

      
          X = df
          #X = df.drop("label", axis=1)
          y = df["label"]
      
          X_train,\
          X_test,\
          y_train,\
          y_test = train_test_split(X,\
                                    y,\
                                    test_size=0.2,\
                                    random_state=11,\
                                    shuffle=True,\
                                    stratify=target)
      
          target_count = df.label.value_counts()
          print('Class 1:', target_count[0])
          print('Class 0:', target_count[1])
          print('Proportion:', round(target_count[0] / target_count[1], 2), ': 1')
      
          target_count.plot(kind='bar', title='Count (target)');
      
          # Class count
          count_class_index_0, count_class_index_1 = X_train.label.value_counts()
      
          # Divide by class
          count_class_index_0 = X_train[X_train['label'] == '1']
          count_class_index_1 = X_train[X_train['label'] == '0']
      
          df_class_1_over = df_class_1.sample(count_class_index_0, replace=True)
          df_test_over = pd.concat([count_class_index_0, df_class_1_over], axis=0)
      
          print('Random over-sampling:')
          print(df_test_over.label.value_counts())
      
          Random over-sampling:
          1    12682
          0      12682
      
          df_test_over.label.value_counts().plot(kind='bar', title='Count (target)')
      
          # drop label for new X_train and X_test
          X_train_OS = df_test_over.drop("label", axis=1)
          X_test = X_test.drop("label", axis=1)
      
          print(X_train_OS.shape)
          print(X_test.shape)
      
          print(y_train.shape)
          print(y_test.shape)
      
          (25364, 9)
          (3552, 9)
          (14207,)
          (3552,)
      
          cat_transformer = Pipeline(steps=[
              ('cat_imputer', SimpleImputer(strategy='constant', fill_value='missing')),
              ('cat_ohe', OneHotEncoder(handle_unknown='ignore'))])
      
          num_transformer = Pipeline(steps=[
              ('num_imputer', SimpleImputer(strategy='constant', fill_value=0)),
              ('num_scaler', StandardScaler())])
      
          text_transformer_0 = Pipeline(steps=[
              ('text_bow', CountVectorizer(lowercase=True,\
                                           token_pattern=SPLIT_PATTERN,\
                                           stop_words=stopwords))])
          # SelectKBest()
          # TruncatedSVD()
      
          text_transformer_1 = Pipeline(steps=[
              ('text_bow', CountVectorizer(lowercase=True,\
                                           token_pattern=SPLIT_PATTERN,\
                                           stop_words=stopwords))])
          # SelectKBest()
          # TruncatedSVD()
      
          FE = ColumnTransformer(
              transformers=[
                  ('cat', cat_transformer, CAT_FEATURES),
                  ('num', num_transformer, NUM_FEATURES),
                  ('text0', text_transformer_0, TEXT_FEATURES[0]),
                  ('text1', text_transformer_1, TEXT_FEATURES[1])])
      
          pipe = Pipeline(steps=[('feature_engineer', FE),
                               ("scales", MaxAbsScaler()),
                               ('rand_forest', RandomForestClassifier(n_jobs=-1, class_weight='balanced'))])
      
          random_grid = {"rand_forest__max_depth": [3, 10, 100, None],\
                        "rand_forest__n_estimators": sp_randint(10, 100),\
                        "rand_forest__max_features": ["auto", "sqrt", "log2", None],\
                        "rand_forest__bootstrap": [True, False],\
                        "rand_forest__criterion": ["gini", "entropy"]}
      
          strat_shuffle_fold = StratifiedKFold(n_splits=5,\
            random_state=123,\
            shuffle=True)
      
          cv_train = RandomizedSearchCV(pipe, param_distributions=random_grid, cv=strat_shuffle_fold)
          cv_train.fit(X_train_OS, y_train)
      
          from sklearn.metrics import classification_report, confusion_matrix
          preds = cv_train.predict(X_test)
          print(confusion_matrix(y_test, preds))
          print(classification_report(y_test, preds))
      
      


      推荐答案

      您可以通过 SMOTE 轻松解决您遇到的问题(可以说是更为优雅)。它易于使用,并允许 train_test_split 保留 X_train,X_test,y_train,y_test 语法,因为它将执行

      The problem you are having here gets very easily (and arguably more elegantly) solved by SMOTE. It's easy to use and allows you to keep the X_train, X_test, y_train, y_test syntax from train_test_split because it will perform the oversampling both on X and y at the same time.

      from imblearn.over_sampling import SMOTE
      
      X_train, X_test, y_train, y_test = train_test_split(X,y)
      sm = SMOTE(random_state=42)
      X_resampled, y_resampled = sm.fit_resample(X_train, y_train)
      

      这篇关于过度采样类别不平衡训练/测试拆分“找到的输入变量样本数量不一致”解?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆