如何在Scikit-Learn中重用LabelBinarizer进行输入预测 [英] How to re-use LabelBinarizer for input prediction in Scikit-Learn

查看:131
本文介绍了如何在Scikit-Learn中重用LabelBinarizer进行输入预测的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我使用Scikit-Learn训练了分类器.我正在加载输入以从CSV训练我的分类器.我的某些列(例如镇")的值是规范的(例如可以是"New York","Paris","Stockholm"等).为了使用这些规范列,我正在使用Scikit-Learn的LabelBinarizer进行一次热编码.

I trained a classifier using Scikit-Learn. I am loading the input to train my classifier from a CSV. The value of some of my columns (e.g. 'Town') are canonical (e.g. can be 'New York', 'Paris', 'Stockholm', ...) . In order to use those canonical columns, I am doing one-hot encoding with the LabelBinarizer from Scikit-Learn.

这是我在训练之前如何转换数据的方法:

This is how I transform data before training:

import pandas as pd
from sklearn.preprocessing import LabelBinarizer

headers = [ 
    'Ref.', 'Town' #,...
]

df = pd.read_csv("/path/to/some.csv", header=None, names=headers, na_values="?")

lb = LabelBinarizer()
lb_results = lb.fit_transform(df['Town'])

但是,我不清楚如何使用LabelBinarizer使用要对其进行预测的新输入数据来创建特征向量.特别是,如果新数据包含可见的城镇(例如纽约),则需要在训练数据中将其编码为与同一城镇相同的位置.

It is however not clear to me how to use the LabelBinarizer to create feature vectors using new input data for which I want to do predictions. Especially, if new data contains a seen town (eg New York) it needs to be encoded at the same place as the same town in the training data.

标签二值化应该如何重新应用于新的输入数据?

(如果有人知道如何使用Pandas的get_dummies方法也可以,我对Scikit-Learn的感觉并不强烈.)

推荐答案

对于已经训练好的lb模型,只需使用lb.transform().

Just use lb.transform() for already trained lb model.

演示:

假设我们有以下火车DF:

Assuming we have the following train DF:

In [250]: df
Out[250]:
           Town
0      New York
1        Munich
2          Kiev
3         Paris
4        Berlin
5      New York
6  Zaporizhzhia

适合(火车)和一步转换(二值化):

Fit (train) & transform (binarize) in one step:

In [251]: r1 = pd.DataFrame(lb.fit_transform(df['Town']), columns=lb.classes_)

收益:

In [252]: r1
Out[252]:
   Berlin  Kiev  Munich  New York  Paris  Zaporizhzhia
0       0     0       0         1      0             0
1       0     0       1         0      0             0
2       0     1       0         0      0             0
3       0     0       0         0      1             0
4       1     0       0         0      0             0
5       0     0       0         1      0             0
6       0     0       0         0      0             1

lb现在已针对我们在df

现在,我们可以使用经过训练的lb模型(使用

Now we can binarize new data sets using trained lb model (using lb.transform()):

In [253]: new
Out[253]:
       Town
0    Munich
1  New York
2     Dubai  # <--- new (not trained) town

In [254]: r2 = pd.DataFrame(lb.transform(new['Town']), columns=lb.classes_)

In [255]: r2
Out[255]:
   Berlin  Kiev  Munich  New York  Paris  Zaporizhzhia
0       0     0       1         0      0             0
1       0     0       0         1      0             0
2       0     0       0         0      0             0

这篇关于如何在Scikit-Learn中重用LabelBinarizer进行输入预测的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆