如何预处理新实例进行分类,以使特征编码与使用Scikit-learn的模型相同? [英] How to pre-process new instances for classification, so that the feature encoding is the same as the model with Scikit-learn?

查看:95
本文介绍了如何预处理新实例进行分类,以使特征编码与使用Scikit-learn的模型相同?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用对数据进行多分类的模型来创建模型,该模型具有6个功能.我正在使用LabelEncoder使用以下代码预处理数据.

I am creating models using multi-class classification for data, which has 6 features. I am pre-processing the data with the code below, using LabelEncoder.

#Encodes the data for each column.
def pre_process_data(self):
    self.encode_column('feedback_rating')
    self.encode_column('location')
    self.encode_column('condition_id')
    self.encode_column('auction_length')
    self.encode_column('model')
    self.encode_column('gb') 

#Gets the column using the column name, transforms the column data and resets
#the column
def encode_column(self, name):
    le = preprocessing.LabelEncoder()
    current_column = np.array(self.X_df[name]).tolist()
    self.X_df[name] = le.fit_transform(current_column)

当我要预测新实例时,我需要转换新实例的数据,以使特征与模型中的特征匹配相同的编码.有没有简单的方法可以做到这一点?

When I want to predict a new instance I need to transform the data of the new instance so that the features match the same encoding as those in the model. Is there a simple way of achieving this?

如果我想保留模型并对其进行检索,那么是否存在一种简单的保存编码格式的方法,以便将其用于在检索到的模型上转换新实例?

Also if I want to persist the model and retrieve it, then is there a simple way of saving the encoding format, in order to use it to transform new instances on the retrieved model?

推荐答案

当我要预测新实例时,我需要转换新实例的数据,以使特征与模型中的特征匹配相同的编码.有没有简单的方法可以做到这一点?

When I want to predict a new instance I need to transform the data of the new instance so that the features match the same encoding as those in the model. Is there a simple way of achieving this?

如果不能完全确定分类管道"的工作方式,但是可以对某些新数据使用fit LabelEncoder方法-le将转换新数据,前提是标签是训练集中存在的内容. /p>

If not entirely sure how your classification 'pipeline' operates, but you can just use your fit LabelEncoder method on some new data - le will transform new data, provided the labels are what exist in training set.

from sklearn import preprocessing
le = preprocessing.LabelEncoder()

# training data
train_x = [0,1,2,6,'true','false']
le.fit_transform(train_x)
# array([0, 1, 1, 2, 4, 3])

# transform some new data
new_x = [0,0,0,2,2,2,'false']
le.transform(new_x)
# array([0, 0, 0, 1, 1, 1, 3])

# transform data with a new feature
bad_x = [0,2,6,'new_word']
le.transform(bad_x)
# ValueError: y contains new labels: ['0' 'new_word']

如果我想保留模型并对其进行检索,那么是否存在一种简单的保存编码格式的方法,以便将其用于在检索到的模型上转换新实例?

Also if I want to persist the model and retrieve it, then is there a simple way of saving the encoding format, in order to use it to transform new instances on the retrieved model?

您可以像这样保存模型/模型的一部分:

You can save models/parts of your models like this:

import cPickle as pickle
from sklearn.externals import joblib
from sklearn import preprocessing

le = preprocessing.LabelEncoder()
train_x = [0,1,2,6,'true','false']
le.fit_transform(train_x)

# Save your encoding
joblib.dump(le, '/path/to/save/model')
# OR
pickle.dump(le, open( '/path/to/model', "wb" ) )

# Load those encodings
le = joblib.load('/path/to/save/model') 
# OR
le = pickle.load( open( '/path/to/model', "rb" ) )

# Then use as normal
new_x = [0,0,0,2,2,2,'false']
le.transform(new_x)
# array([0, 0, 0, 1, 1, 1, 3])

这篇关于如何预处理新实例进行分类,以使特征编码与使用Scikit-learn的模型相同?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆