在imblearn管道中使用SMOTENC实现FAMD时发生AttributeError [英] AttributeError while implementing FAMD with SMOTENC in a imblearn pipeline
问题描述
我正在尝试使用FAMD,SMOTENC和其他预处理步骤来实现管道.但是每次都会出错.如果我从管道中删除FAMD,则效果很好.
I'm trying to implement a pipeline with FAMD, SMOTENC, and other preprocessing steps. However it gives error each time. If i remove FAMD from the pipeline it works fine.
我的代码:
#Seperate the dataset in two parts
num_df= X_train_new.select_dtypes(include=[np.number]).columns
cat_df= X_train_new.select_dtypes(exclude=[np.number]).columns
#Create a mask for categorical features
categorical_feature_mask = X_train_new.dtypes == object
print(categorical_feature_mask)
from sklearn.pipeline import make_pipeline
from sklearn.compose import make_column_transformer
from sklearn.compose import make_column_selector as selector
#Create a pipeline to automate the preprocessing steps and SMOTENC together
num_pipe = make_pipeline(SimpleImputer(strategy='median'))
cat_pipe = make_pipeline(SimpleImputer(strategy='most_frequent'),
OneHotEncoder(handle_unknown='ignore'))
transformer= make_column_transformer((num_pipe, selector(dtype_include='number')),
(cat_pipe, selector(dtype_include='object')),n_jobs=2)
#Undersampling with SMOTENC
from imblearn.over_sampling import SMOTENC
smote= SMOTENC(categorical_features=categorical_feature_mask,random_state=99)
!pip install prince
from prince import FAMD
famd=FAMD(n_components=4,random_state=99)
from imblearn.pipeline import make_pipeline as imb_pipeline
#Fit the random forest learner
rf=RandomForestClassifier(n_estimators=300random_state=99)
pipe=imb_pipeline(transformer,smote,famd,rf)
pipe.fit(X_train_new,y_train_new)
print('Training Accuracy:%s'%pipe.score(X_train_new,y_train_new))
错误:
AttributeError Traceback (most recent call last)
<ipython-input-24-2b7ea084a318> in <module>()
3 rf=RandomForestClassifier(n_estimators=300,max_features=3,criterion='entropy',random_state=99)
4 pipe=imb_pipeline(transformer,smote,famd,rf)
----> 5 pipe.fit(X_train_new,y_train_new)
6 print('Training Accuracy:%s'%pipe.score(X_train_new,y_train_new))
6 frames
/usr/local/lib/python3.7/dist-packages/imblearn/pipeline.py in fit(self, X, y, **fit_params)
235
236 """
--> 237 Xt, yt, fit_params = self._fit(X, y, **fit_params)
238 if self._final_estimator is not None:
239 self._final_estimator.fit(Xt, yt, **fit_params)
/usr/local/lib/python3.7/dist-packages/imblearn/pipeline.py in _fit(self, X, y, **fit_params)
195 Xt, fitted_transformer = fit_transform_one_cached(
196 cloned_transformer, None, Xt, yt,
--> 197 **fit_params_steps[name])
198 elif hasattr(cloned_transformer, "fit_resample"):
199 Xt, yt, fitted_transformer = fit_resample_one_cached(
/usr/local/lib/python3.7/dist-packages/joblib/memory.py in __call__(self, *args, **kwargs)
350
351 def __call__(self, *args, **kwargs):
--> 352 return self.func(*args, **kwargs)
353
354 def call_and_shelve(self, *args, **kwargs):
/usr/local/lib/python3.7/dist-packages/imblearn/pipeline.py in _fit_transform_one(transformer, weight, X, y, **fit_params)
564 def _fit_transform_one(transformer, weight, X, y, **fit_params):
565 if hasattr(transformer, 'fit_transform'):
--> 566 res = transformer.fit_transform(X, y, **fit_params)
567 else:
568 res = transformer.fit(X, y, **fit_params).transform(X)
/usr/local/lib/python3.7/dist-packages/sklearn/base.py in fit_transform(self, X, y, **fit_params)
572 else:
573 # fit method of arity 2 (supervised transformation)
--> 574 return self.fit(X, y, **fit_params).transform(X)
575
576
/usr/local/lib/python3.7/dist-packages/prince/famd.py in fit(self, X, y)
27
28 # Separate numerical columns from categorical columns
---> 29 num_cols = X.select_dtypes(np.number).columns.tolist()
30 cat_cols = list(set(X.columns) - set(num_cols))
31
/usr/local/lib/python3.7/dist-packages/scipy/sparse/base.py in __getattr__(self, attr)
689 return self.getnnz()
690 else:
--> 691 raise AttributeError(attr + " not found")
692
693 def transpose(self, axes=None, copy=False):
AttributeError: select_dtypes not found
推荐答案
tl; dr:尝试将 sparse = False
添加到您的 OneHotEncoder
中.考虑处理 prince
的问题,以处理稀疏输入.
tl;dr: try adding sparse=False
to your OneHotEncoder
. Consider raising an Issue with prince
, to handle sparse inputs.
从回溯中可以看到,问题是 FAMD.fit
尝试使用 X.select_dtypes
来分隔分类数据和数字数据. select_dtypes
是一个熊猫函数,因此通常我会假设 prince
被编写为对数据帧进行操作,而不是sklearn在内部使用的numpy数组(必要时从帧转换之后).但是,从源头来看,它们确实从那几行转换为numpy数组到dataframe.但是,最后一条跟踪消息来自scipy.这暗示您的 X
实际上可能是一个稀疏数组.实际上, OneHotEncoder
(在您的管道的前面)更喜欢输出稀疏数组,而 ColumnTransformer
根据其组成部分和参数确定是转换为稀疏还是密集sparse_threshold
.
You can see from the traceback that the problem is that FAMD.fit
tries X.select_dtypes
to separate categorical and numeric data. select_dtypes
is a pandas function, so normally I would assume that prince
is written to operate on dataframes and not the numpy arrays that sklearn uses internally (after converting from frames if necessary). However, looking at the source, a few lines above that one they do convert from numpy array to dataframe. But, the last trace message is from scipy. That hints that your X
may actually be a sparse array. And indeed OneHotEncoder
(earlier in your pipeline) prefers to output sparse arrays, and ColumnTransformer
determines whether to transform into sparse or dense depending on its component parts and the parameter sparse_threshold
.
这篇关于在imblearn管道中使用SMOTENC实现FAMD时发生AttributeError的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!