在多个程序中正确使用 Scikit 的 LabelEncoder [英] Using Scikit's LabelEncoder correctly across multiple programs

查看:15
本文介绍了在多个程序中正确使用 Scikit 的 LabelEncoder的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我手头的基本任务是

a) 读取一些制表符分隔的数据.

a) Read some tab separated data.

b) 做一些基本的预处理

b) Do some basic preprocessing

c) 对于每个分类列,使用 LabelEncoder 创建映射.这有点像这样

c) For each categorical column use LabelEncoder to create a mapping. This is don somewhat like this

mapper={}
#Converting Categorical Data
for x in categorical_list:
     mapper[x]=preprocessing.LabelEncoder()

for x in categorical_list:
     df[x]=mapper[x].fit_transform(df.__getattr__(x))

其中 df 是 Pandas 数据框,categorical_list 是需要转换的列标题列表.

where df is a pandas dataframe and categorical_list is a list of column headers that need to be transformed.

d) 训练分类器并使用 pickle

d) Train a classifier and save it to disk using pickle

e) 现在在另一个程序中,保存的模型被加载.

e) Now in a different program, the model saved is loaded.

f) 加载测试数据并执行相同的预处理.

f) The test data is loaded and the same preprocessing is performed.

g) LabelEncoder's 用于转换分类数据.

g) The LabelEncoder's are used for converting categorical data.

h) 该模型用于预测.

h) The model is used to predict.

现在我的问题是,g) 步骤能正常工作吗?

Now the question that I have is, will the step g) work correctly?

正如 LabelEncoder 的文档所说

It can also be used to transform non-numerical labels (as long as 
they are hashable and comparable) to numerical labels.

那么每个条目每次都会哈希到完全相同的值吗?

So will each entry hash to the exact same value everytime?

如果不是,有什么好的方法可以解决这个问题.有什么方法可以检索编码器的映射?还是与 LabelEncoder 完全不同的方式?

If No, what is a good way to go about this. Any way to retrive the mappings of the encoder? Or an altogether different way from LabelEncoder?

推荐答案

根据LabelEncoder 实现,当且仅当您 fit LabelEncoders 在使用具有完全相同唯一值集的数据进行测试.

According to the LabelEncoder implementation, the pipeline you've described will work correctly if and only if you fit LabelEncoders at the test time with data that have exactly the same set of unique values.

重用您在训练期间获得的 LabelEncoders 有一种有点老套的方法.LabelEncoder 只有一个属性,即classes_.你可以腌制它,然后像

There's a somewhat hacky way to reuse LabelEncoders you got during train. LabelEncoder has only one property, namely, classes_. You can pickle it, and then restore like

训练:

encoder = LabelEncoder()
encoder.fit(X)
numpy.save('classes.npy', encoder.classes_)

测试

encoder = LabelEncoder()
encoder.classes_ = numpy.load('classes.npy')
# Now you should be able to use encoder
# as you would do after `fit`

这似乎比使用相同的数据重新拟合更有效.

This seems more efficient than refitting it using the same data.

这篇关于在多个程序中正确使用 Scikit 的 LabelEncoder的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆