过度采样类别不平衡训练/测试拆分“找到的输入变量样本数量不一致”解? [英] Over-Sampling Class Imbalance Train/Test Split "Found input variables with inconsistent numbers of samples" Solution?
问题描述
尝试按照本文进行过采样以实现不平衡分类。我的课堂比率是8:1。
Trying to follow this article to perform over-sampling for imbalanced classification. My class ratio is about 8:1.
> https://www.kaggle.com/rafjaa/resampling-strategies-for-imbalanced-datasets/notebook
我对管道+编码结构感到困惑。
I am confused on the pipeline + coding structure.
- 在火车/测试分裂后,您是否会过度采样?
- 如果是,您如何处理目标标签从X掉落的事实?我尝试保留它,然后执行过采样,然后在X_train / X_test上放置标签,并在管道
中替换了新的训练集,但是我收到错误消息 发现输入变量的样本数不一致 >,因为形状不一致,因为新的过采样df以50/50的标签分布加倍。
- Should you over-sample after train/test splitting?
- If so, how do you deal with the fact that the target label is dropped from X? I tried keeping it and then performed the over-sampling then dropped labels on X_train/X_test and replaced the new training set in my pipeline however i get error "Found input variables with inconsistent numbers of samples" because the shapes are inconsistent since the new over-sampling df is doubled with a 50/50 label distribution.
我理解这个问题,但是当想要执行过度采样以减少类不平衡时,如何解决这个问题呢?
X = df #X = df.drop("label", axis=1) y = df["label"] X_train,\ X_test,\ y_train,\ y_test = train_test_split(X,\ y,\ test_size=0.2,\ random_state=11,\ shuffle=True,\ stratify=target) target_count = df.label.value_counts() print('Class 1:', target_count[0]) print('Class 0:', target_count[1]) print('Proportion:', round(target_count[0] / target_count[1], 2), ': 1') target_count.plot(kind='bar', title='Count (target)'); # Class count count_class_index_0, count_class_index_1 = X_train.label.value_counts() # Divide by class count_class_index_0 = X_train[X_train['label'] == '1'] count_class_index_1 = X_train[X_train['label'] == '0'] df_class_1_over = df_class_1.sample(count_class_index_0, replace=True) df_test_over = pd.concat([count_class_index_0, df_class_1_over], axis=0) print('Random over-sampling:') print(df_test_over.label.value_counts()) Random over-sampling: 1 12682 0 12682 df_test_over.label.value_counts().plot(kind='bar', title='Count (target)') # drop label for new X_train and X_test X_train_OS = df_test_over.drop("label", axis=1) X_test = X_test.drop("label", axis=1) print(X_train_OS.shape) print(X_test.shape) print(y_train.shape) print(y_test.shape) (25364, 9) (3552, 9) (14207,) (3552,) cat_transformer = Pipeline(steps=[ ('cat_imputer', SimpleImputer(strategy='constant', fill_value='missing')), ('cat_ohe', OneHotEncoder(handle_unknown='ignore'))]) num_transformer = Pipeline(steps=[ ('num_imputer', SimpleImputer(strategy='constant', fill_value=0)), ('num_scaler', StandardScaler())]) text_transformer_0 = Pipeline(steps=[ ('text_bow', CountVectorizer(lowercase=True,\ token_pattern=SPLIT_PATTERN,\ stop_words=stopwords))]) # SelectKBest() # TruncatedSVD() text_transformer_1 = Pipeline(steps=[ ('text_bow', CountVectorizer(lowercase=True,\ token_pattern=SPLIT_PATTERN,\ stop_words=stopwords))]) # SelectKBest() # TruncatedSVD() FE = ColumnTransformer( transformers=[ ('cat', cat_transformer, CAT_FEATURES), ('num', num_transformer, NUM_FEATURES), ('text0', text_transformer_0, TEXT_FEATURES[0]), ('text1', text_transformer_1, TEXT_FEATURES[1])]) pipe = Pipeline(steps=[('feature_engineer', FE), ("scales", MaxAbsScaler()), ('rand_forest', RandomForestClassifier(n_jobs=-1, class_weight='balanced'))]) random_grid = {"rand_forest__max_depth": [3, 10, 100, None],\ "rand_forest__n_estimators": sp_randint(10, 100),\ "rand_forest__max_features": ["auto", "sqrt", "log2", None],\ "rand_forest__bootstrap": [True, False],\ "rand_forest__criterion": ["gini", "entropy"]} strat_shuffle_fold = StratifiedKFold(n_splits=5,\ random_state=123,\ shuffle=True) cv_train = RandomizedSearchCV(pipe, param_distributions=random_grid, cv=strat_shuffle_fold) cv_train.fit(X_train_OS, y_train) from sklearn.metrics import classification_report, confusion_matrix preds = cv_train.predict(X_test) print(confusion_matrix(y_test, preds)) print(classification_report(y_test, preds))
推荐答案
您可以通过
SMOTE
轻松解决您遇到的问题(可以说是更为优雅)。它易于使用,并允许train_test_split
保留X_train,X_test,y_train,y_test
语法,因为它将执行The problem you are having here gets very easily (and arguably more elegantly) solved by
SMOTE
. It's easy to use and allows you to keep theX_train, X_test, y_train, y_test
syntax fromtrain_test_split
because it will perform the oversampling both on X and y at the same time.from imblearn.over_sampling import SMOTE X_train, X_test, y_train, y_test = train_test_split(X,y) sm = SMOTE(random_state=42) X_resampled, y_resampled = sm.fit_resample(X_train, y_train)
这篇关于过度采样类别不平衡训练/测试拆分“找到的输入变量样本数量不一致”解?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!
- 如果是,您如何处理目标标签从X掉落的事实?我尝试保留它,然后执行过采样,然后在X_train / X_test上放置标签,并在管道