加载并预测新数据 [英] Load and predict new data sklearn
问题描述
我训练了一个Logistic模型,进行了交叉验证,并使用joblib模块将其保存到文件中.现在,我想加载此模型并使用它预测新数据. 这是正确的方法吗?尤其是标准化.我也应该在新数据上使用scaler.fit()吗?在我遵循的教程中,scaler.fit仅用于训练集,所以我在这里有点迷茫.
I trained a Logistic model, cross-validated and saved it to file using joblib module. Now I want to load this model and predict new data with it. Is this the correct way to do this? Especially the standardization. Should I use scaler.fit() on my new data too? In the tutorials I followed, scaler.fit was only used on the training set, so I'm a bit lost here.
这是我的代码:
#Loading the saved model with joblib
model = joblib.load('model.pkl')
# New data to predict
pr = pd.read_csv('set_to_predict.csv')
pred_cols = list(pr.columns.values)[:-1]
# Standardize new data
scaler = StandardScaler()
X_pred = scaler.fit(pr[pred_cols]).transform(pr[pred_cols])
pred = pd.Series(model.predict(X_pred))
print pred
推荐答案
否,这是不正确的.所有数据准备步骤均应使用火车数据进行拟合.否则,您可能会冒险应用错误的转换,因为StandardScaler
估计的均值和方差在训练数据和测试数据之间可能确实有所不同.
No, it's incorrect. All the data preparation steps should be fit using train data. Otherwise, you risk applying the wrong transformations, because means and variances that StandardScaler
estimates do probably differ between train and test data.
同时训练,保存,加载和应用所有步骤的最简单方法是使用管道:
The easiest way to train, save, load and apply all the steps simultaneously is to use Pipelines:
在培训中:
# prepare the pipeline
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.externals import joblib
pipe = make_pipeline(StandardScaler(), LogisticRegression)
pipe.fit(X_train, y_train)
joblib.dump(pipe, 'model.pkl')
处于预测状态:
#Loading the saved model with joblib
pipe = joblib.load('model.pkl')
# New data to predict
pr = pd.read_csv('set_to_predict.csv')
pred_cols = list(pr.columns.values)[:-1]
# apply the whole pipeline to data
pred = pd.Series(pipe.predict(pr[pred_cols]))
print pred
这篇关于加载并预测新数据的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!