使用自定义转换器时如何正确腌制sklearn管道 [英] How to properly pickle sklearn pipeline when using custom transformer

查看:151
本文介绍了使用自定义转换器时如何正确腌制sklearn管道的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试腌制一个sklearn机器学习模型,并将其加载到另一个项目中.该模型包装在具有编码,缩放等功能的管道中.当我想在管道中使用自写转换器执行更高级的任务时,问题就开始了.

I am trying to pickle a sklearn machine-learning model, and load it in another project. The model is wrapped in pipeline that does feature encoding, scaling etc. The problem starts when i want to use self-written transformers in the pipeline for more advanced tasks.

假设我有2个项目:

  • train_project:在src.feature_extraction.transformers.py中具有自定义转换器
  • use_project:它在src中包含其他内容,或者根本没有src目录

如果在"train_project"中使用joblib.dump()保存管道,然后在"use_project"中通过joblib.load()加载管道,则找不到"src.feature_extraction.transformers"之类的东西并抛出例外:

If in "train_project" I save the pipeline with joblib.dump(), and then in "use_project" i load it with joblib.load() it will not find something such as "src.feature_extraction.transformers" and throw exception:

ModuleNotFoundError:没有名为"src.feature_extraction"的模块

ModuleNotFoundError: No module named 'src.feature_extraction'

我还应该补充一点,我的初衷是简化模型的使用,以便程序员可以像加载其他模型一样加载模型,传递非常简单的,人类可读的功能,以及对功能进行所有神奇的"预处理,以实现实际的功能.内部发生了模型(例如梯度增强).

I should also add that my intention from the beginning was to simplify usage of the model, so programist can load the model as any other model, pass very simple, human readable features, and all "magic" preprocessing of features for actual model (e.g. gradient boosting) is happening inside.

我想到了在两个项目的根目录中创建/dependencies/xxx_model/目录,并在其中存储所有需要的类和函数(将代码从"train_project"复制到"use_project"),因此项目的结构是相等的,并且转换器可以被加载.我发现此解决方案非常不雅致,因为它会强制使用该模型的任何项目的结构.

I thought of creating /dependencies/xxx_model/ catalog in root of both projects, and store all needed classes and functions in there (copy code from "train_project" to "use_project"), so structure of projects is equal and transformers can be loaded. I find this solution extremely inelegant, because it would force the structure of any project where the model would be used.

我想到了只是在"use_project"中重新创建管道和所有变压器,并以某种方式从"train_project"中加载变压器的拟合值.

I thought of just recreating the pipeline and all transformers inside "use_project" and somehow loading fitted values of transformers from "train_project".

最好的解决方案是,如果转储的文件包含所有需要的信息并且不需要依赖,我真的为sklearn.Pipelines感到震惊-如果我无法加载,则适合管道的意义何在?以后对象?是的,如果我仅使用sklearn类,而不创建自定义类,但非自定义类没有所有必需的功能,那将是可行的.

The best possible solution would be if dumped file contained all needed info and needed no dependencies, and I am honestly shocked that sklearn.Pipelines seem to not have that possibility - what's the point of fitting a pipeline if i can not load fitted object later? Yes it would work if i used only sklearn classes, and not create custom ones, but non-custom ones do not have all needed functionality.

示例代码:

train_project

train_project

src.feature_extraction.transformers.py

src.feature_extraction.transformers.py

from sklearn.pipeline import TransformerMixin
class FilterOutBigValuesTransformer(TransformerMixin):
    def __init__(self):
        pass

    def fit(self, X, y=None):
        self.biggest_value = X.c1.max()
        return self

    def transform(self, X):
        return X.loc[X.c1 <= self.biggest_value]

train_project

train_project

main.py

from sklearn.externals import joblib
from sklearn.preprocessing import MinMaxScaler
from src.feature_extraction.transformers import FilterOutBigValuesTransformer

pipeline = Pipeline([
    ('filter', FilterOutBigValuesTransformer()),
    ('encode', MinMaxScaler()),
])
X=load_some_pandas_dataframe()
pipeline.fit(X)
joblib.dump(pipeline, 'path.x')

test_project

test_project

main.py

from sklearn.externals import joblib

pipeline = joblib.load('path.x')

预期结果是使用正确的转换方法正确加载了管道.

The expected result is pipeline loaded correctly with transform method possible to use.

加载文件时,实际结果是异常.

Actual result is exception when loading the file.

推荐答案

我找到了一个非常简单的解决方案.假设您正在使用Jupyter笔记本进行培训:

I found a pretty straightforward solution. Assuming you are using Jupyter notebooks for training:

  1. 在定义自定义转换器的地方创建一个.py文件,并将其导入到Jupyter笔记本中.
  1. Create a .py file where the custom transformer is defined and import it to the Jupyter notebook.

这是文件custom_transformer.py

from sklearn.pipeline import TransformerMixin

class FilterOutBigValuesTransformer(TransformerMixin):
    def __init__(self):
        pass

    def fit(self, X, y=None):
        self.biggest_value = X.c1.max()
        return self

    def transform(self, X):
        return X.loc[X.c1 <= self.biggest_value]

  1. 训练模型从.py文件导入此类并使用joblib保存它.
  1. Train your model importing this class from the .py file and save it using joblib.

import joblib
from custom_transformer import FilterOutBigValuesTransformer
from sklearn.externals import joblib
from sklearn.preprocessing import MinMaxScaler

pipeline = Pipeline([
    ('filter', FilterOutBigValuesTransformer()),
    ('encode', MinMaxScaler()),
])

X=load_some_pandas_dataframe()
pipeline.fit(X)

joblib.dump(pipeline, 'pipeline.pkl')

  1. 在不同的python脚本中加载.pkl文件时,您必须导入.py文件以使其起作用:
  1. When loading the .pkl file in a different python script, you will have to import the .py file in order to make it work:

import joblib
from utils import custom_transformer # decided to save it in a utils directory

pipeline = joblib.load('pipeline.pkl')

这篇关于使用自定义转换器时如何正确腌制sklearn管道的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆