如何使用 sklearn 列转换器? [英] How to use sklearn Column Transformer?

查看:62
本文介绍了如何使用 sklearn 列转换器?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试使用 LabelEncoder 然后使用 OneHotEncoder 将分类值(在我的情况下是国家/地区列)转换为编码值,并且能够转换分类值.但是我收到警告,比如 OneHotEncoder 'categorical_features' 关键字已被弃用请改用 ColumnTransformer".那么我如何使用 ColumnTransformer 来实现相同的结果?

I'm trying to convert categorical value (in my case it is country column) into encoded value using LabelEncoder and then with OneHotEncoder and was able to convert the categorical value. But i'm getting warning like OneHotEncoder 'categorical_features' keyword is deprecated "use the ColumnTransformer instead." So how i can use ColumnTransformer to achieve same result ?

下面是我的输入数据集和我试过的代码

Below is my input data set and the code which i tried

Input Data set

Country Age Salary
France  44  72000
Spain   27  48000
Germany 30  54000
Spain   38  61000
Germany 40  67000
France  35  58000
Spain   26  52000
France  48  79000
Germany 50  83000
France  37  67000


import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelEncoder, OneHotEncoder

#X is my dataset variable name

label_encoder = LabelEncoder()
x.iloc[:,0] = label_encoder.fit_transform(x.iloc[:,0]) #LabelEncoder is used to encode the country value
hot_encoder = OneHotEncoder(categorical_features = [0])
x = hot_encoder.fit_transform(x).toarray()

我得到的输出是,如何使用列转换器获得相同的输出

And the output i'm getting as, How can i get the same output with column transformer

0(fran) 1(ger) 2(spain) 3(age)  4(salary)
1         0       0      44        72000
0         0       1      27        48000
0         1       0      30        54000
0         0       1      38        61000
0         1       0      40        67000
1         0       0      35        58000
0         0       1      36        52000
1         0       0      48        79000
0         1       0      50        83000
1         0       0      37        67000

我尝试了以下代码

from sklearn.compose import ColumnTransformer, make_column_transformer

preprocess = make_column_transformer(

    ( [0], OneHotEncoder())
)
x = preprocess.fit_transform(x).toarray()

我能够使用上述代码对国家/地区列进行编码,但在转换后缺少 x 变量中的年龄和工资列

i was able to encode country column with the above code, but missing age and salary column from x varible after transforming

推荐答案

将连续数据编码为 Salary 有点奇怪.除非您将薪水划分为特定范围/类别,否则这毫无意义.如果我在你我会做的:

It is a bit strange to encode continuous data as Salary. It makes no sense unless you have binned your salary to certain ranges/categories. If I where you I would do:

import pandas as pd
import numpy as np

from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder



numeric_features = ['Salary']
numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())])

categorical_features = ['Age','Country']
categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))])

preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
        ('cat', categorical_transformer, categorical_features)])

从这里您可以使用分类器进行管道传输,例如

from here you can pipe it with a classifier e.g.

clf = Pipeline(steps=[('preprocessor', preprocessor),
                  ('classifier', LogisticRegression(solver='lbfgs'))])  
                  

这样使用:

clf.fit(X_train,y_train)

这将应用预处理器,然后将转换后的数据传递给预测器.

this will apply the preprocessor and then pass transfomed data to the predictor.

如果我们想动态选择数据类型,我们可以修改我们的预处理器以使用数据类型的列选择器:

If we want to select data types on fly, we can modify our preprocessor to use column selector by data dtypes:

from sklearn.compose import make_column_selector as selector

preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, selector(dtype_include="numeric")),
        ('cat', categorical_transformer, selector(dtype_include="category"))])

使用网格搜索


param_grid = {
    'preprocessor__num__imputer__strategy': ['mean', 'median'],
    'classifier__C': [0.1, 1.0, 10, 100],
    'Classifier__solver': ['lbfgs', 'sag'],
}

grid_search = GridSearchCV(clf, param_grid, cv=10)
grid_search

这篇关于如何使用 sklearn 列转换器?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆