如何在Pandas DataFrame的几列中进行一次热编码,以供以后与Scikit-Learn一起使用 [英] How to do one-hot encoding in several columns of a Pandas DataFrame for later use with Scikit-Learn

查看:108
本文介绍了如何在Pandas DataFrame的几列中进行一次热编码,以供以后与Scikit-Learn一起使用的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

说我有以下数据

import pandas as pd
data = {
    'Reference': [1, 2, 3, 4, 5],
    'Brand': ['Volkswagen', 'Volvo', 'Volvo', 'Audi', 'Volkswagen'],
    'Town': ['Berlin', 'Berlin', 'Stockholm', 'Munich', 'Berlin'],
    'Mileage': [35000, 45000, 121000, 35000, 181000],
    'Year': [2015, 2014, 2012, 2016, 2013]
 }
df = pd.DataFrame(data)

我想在这两个列上对"Brand"和"Town"两列进行一次热编码,以训练分类器(例如,使用Scikit-Learn)并预测年份.

On which I would like to do one-hot encoding on the two columns "Brand" and "Town" in order to train a classifier (say with Scikit-Learn) and predict the year.

一旦对分类器进行了训练,我将希望根据新的传入数据(在训练中不使用)来预测年份,在那里我将需要重新应用相同的热编码.例如:

Once the classifier is trained I will want to predict the year on new incoming data (not use in the training), where I will need to re-apply the same hot encoding. For example:

new_data = {
    'Reference': [6, 7],
    'Brand': ['Volvo', 'Audi'],
    'Town': ['Stockholm', 'Munich']
}

在这种情况下,知道需要对多个列进行编码并且需要能够应用相同的列,对Pandas DataFrame上的2列进行一次热编码的最佳方法是什么稍后对新数据进行编码.

In this context, what is the best way to do one-hot encoding of the 2 columns on the Pandas DataFrame knowing that there is a need to encode several columns, and that there is a need to be able to apply the same encoding on new data later.

这是如何重用LabelBinarizer在SkLearn中进行输入预测

推荐答案

考虑以下方法.

演示:

from sklearn.preprocessing import LabelBinarizer
from collections import defaultdict

d = defaultdict(LabelBinarizer)

In [7]: cols2bnrz = ['Brand','Town']

In [8]: df[cols2bnrz].apply(lambda x: d[x.name].fit(x))
Out[8]:
Brand    LabelBinarizer(neg_label=0, pos_label=1, spars...
Town     LabelBinarizer(neg_label=0, pos_label=1, spars...
dtype: object

In [10]: new = pd.DataFrame({
    ...:     'Reference': [6, 7],
    ...:     'Brand': ['Volvo', 'Audi'],
    ...:     'Town': ['Stockholm', 'Munich']
    ...: })

In [11]: new
Out[11]:
   Brand  Reference       Town
0  Volvo          6  Stockholm
1   Audi          7     Munich

In [12]: pd.DataFrame(d['Brand'].transform(new['Brand']), columns=d['Brand'].classes_)
Out[12]:
   Audi  Volkswagen  Volvo
0     0           0      1
1     1           0      0

In [13]: pd.DataFrame(d['Town'].transform(new['Town']), columns=d['Town'].classes_)
Out[13]:
   Berlin  Munich  Stockholm
0       0       0          1
1       0       1          0

这篇关于如何在Pandas DataFrame的几列中进行一次热编码,以供以后与Scikit-Learn一起使用的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆