如何使用 sklearn 列转换器? [英] How to use sklearn Column Transformer?
问题描述
我正在尝试使用 LabelEncoder 然后使用 OneHotEncoder 将分类值(在我的情况下是国家/地区列)转换为编码值,并且能够转换分类值.但是我收到警告,比如 OneHotEncoder 'categorical_features' 关键字已被弃用请改用 ColumnTransformer".那么我如何使用 ColumnTransformer 来实现相同的结果?
I'm trying to convert categorical value (in my case it is country column) into encoded value using LabelEncoder and then with OneHotEncoder and was able to convert the categorical value. But i'm getting warning like OneHotEncoder 'categorical_features' keyword is deprecated "use the ColumnTransformer instead." So how i can use ColumnTransformer to achieve same result ?
下面是我的输入数据集和我试过的代码
Below is my input data set and the code which i tried
Input Data set
Country Age Salary
France 44 72000
Spain 27 48000
Germany 30 54000
Spain 38 61000
Germany 40 67000
France 35 58000
Spain 26 52000
France 48 79000
Germany 50 83000
France 37 67000
import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
#X is my dataset variable name
label_encoder = LabelEncoder()
x.iloc[:,0] = label_encoder.fit_transform(x.iloc[:,0]) #LabelEncoder is used to encode the country value
hot_encoder = OneHotEncoder(categorical_features = [0])
x = hot_encoder.fit_transform(x).toarray()
我得到的输出是,如何使用列转换器获得相同的输出
And the output i'm getting as, How can i get the same output with column transformer
0(fran) 1(ger) 2(spain) 3(age) 4(salary)
1 0 0 44 72000
0 0 1 27 48000
0 1 0 30 54000
0 0 1 38 61000
0 1 0 40 67000
1 0 0 35 58000
0 0 1 36 52000
1 0 0 48 79000
0 1 0 50 83000
1 0 0 37 67000
我尝试了以下代码
from sklearn.compose import ColumnTransformer, make_column_transformer
preprocess = make_column_transformer(
( [0], OneHotEncoder())
)
x = preprocess.fit_transform(x).toarray()
我能够使用上述代码对国家/地区列进行编码,但在转换后缺少 x 变量中的年龄和工资列
i was able to encode country column with the above code, but missing age and salary column from x varible after transforming
推荐答案
将连续数据编码为 Salary 有点奇怪.除非您将薪水划分为特定范围/类别,否则这毫无意义.如果我在你我会做的:
It is a bit strange to encode continuous data as Salary. It makes no sense unless you have binned your salary to certain ranges/categories. If I where you I would do:
import pandas as pd
import numpy as np
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
numeric_features = ['Salary']
numeric_transformer = Pipeline(steps=[
('imputer', SimpleImputer(strategy='median')),
('scaler', StandardScaler())])
categorical_features = ['Age','Country']
categorical_transformer = Pipeline(steps=[
('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
('onehot', OneHotEncoder(handle_unknown='ignore'))])
preprocessor = ColumnTransformer(
transformers=[
('num', numeric_transformer, numeric_features),
('cat', categorical_transformer, categorical_features)])
从这里您可以使用分类器进行管道传输,例如
from here you can pipe it with a classifier e.g.
clf = Pipeline(steps=[('preprocessor', preprocessor),
('classifier', LogisticRegression(solver='lbfgs'))])
这样使用:
clf.fit(X_train,y_train)
这将应用预处理器,然后将转换后的数据传递给预测器.
this will apply the preprocessor and then pass transfomed data to the predictor.
如果我们想动态选择数据类型,我们可以修改我们的预处理器以使用数据类型的列选择器:
If we want to select data types on fly, we can modify our preprocessor to use column selector by data dtypes:
from sklearn.compose import make_column_selector as selector
preprocessor = ColumnTransformer(
transformers=[
('num', numeric_transformer, selector(dtype_include="numeric")),
('cat', categorical_transformer, selector(dtype_include="category"))])
使用网格搜索
param_grid = {
'preprocessor__num__imputer__strategy': ['mean', 'median'],
'classifier__C': [0.1, 1.0, 10, 100],
'Classifier__solver': ['lbfgs', 'sag'],
}
grid_search = GridSearchCV(clf, param_grid, cv=10)
grid_search
这篇关于如何使用 sklearn 列转换器?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!