使用Dill序列化scikit-learn/statsmodels模型有哪些陷阱? [英] What are the pitfalls of using Dill to serialise scikit-learn/statsmodels models?

查看:105
本文介绍了使用Dill序列化scikit-learn/statsmodels模型有哪些陷阱?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我需要对scikit-learn/statsmodels模型进行序列化,以便将所有依赖项(代码+数据)打包在人工制品中,并且可以使用该人工制品来初始化模型并进行预测.不能使用pickle module,因为这只会处理数据依赖关系(不会打包代码).因此,我一直在与莳萝进行实验.为了使我的问题更精确,下面是一个示例,其中我构建了一个模型并将其持久化.

I need to serialise scikit-learn/statsmodels models such that all the dependencies (code + data) are packaged in an artefact and this artefact can be used to initialise the model and make predictions. Using the pickle module is not an option because this will only take care of the data dependency (the code will not be packaged). So, I have been conducting experiments with Dill. To make my question more precise, the following is an example where I build a model and persist it.

from sklearn import datasets
from sklearn import svm
from sklearn.preprocessing import Normalizer
import dill

digits = datasets.load_digits()
training_data_X = digits.data[:-5]
training_data_Y = digits.target[:-5]
test_data_X = digits.data[-5:]
test_data_Y = digits.target[-5:]

class Model:
    def __init__(self):
        self.normalizer = Normalizer()
        self.clf = svm.SVC(gamma=0.001, C=100.)
    def train(self, training_data_X, training_data_Y):
        normalised_training_data_X = normalizer.fit_transform(training_data_X)
        self.clf.fit(normalised_training_data_X, training_data_Y)
    def predict(self, test_data_X):
        return self.clf.predict(self.normalizer.fit_transform(test_data_X))  

model = Model()
model.train(training_data_X, training_data_Y)
print model.predict(test_data_X)
dill.dump(model, open("my_model.dill", 'w'))

与此对应,这是我初始化持久化模型(在新的会话中)并进行预测的方法.请注意,此代码没有明确初始化或了解class Model.

Corresponding to this, here is how I initialise the persisted model (in a new session) and make a prediction. Note that this code does not explicitly initialise or have knowledge of the class Model.

import dill
from sklearn import datasets

digits = datasets.load_digits()
training_data_X = digits.data[:-5]
training_data_Y = digits.target[:-5]
test_data_X = digits.data[-5:]
test_data_Y = digits.target[-5:]

with open("my_model.dill") as model_file:
    model = dill.load(model_file)

print model.predict(test_data_X)

有人用过莳萝吗?数据科学家的想法是为他们实现的每个模型扩展ModelWrapper class,然后围绕该模型构建基础结构以保留模型,将模型作为服务部署并管理模型的整个生命周期.

Has anyone used Dill isn this way?. The idea is for a data scientist to extend a ModelWrapper class for each model they implement and then build the infrastructure around this that persists the models, deploy the models as services and manage the entire lifecycle of the model.

class ModelWrapper(object):
    __metaclass__ = abc.ABCMeta
    def __init__(self, model):
        self.model = model
    @abc.abstractmethod
    def predict(self, input):
        return
    def dumps(self):
        return dill.dumps(self)
    def loads(self, model_string):
        self.model = dill.loads(model_string)

除了安全隐患(任意代码执行)以及像模型那样的机器上必须安装scikit-learn之类的模块的要求之外,这种方法是否存在其他缺陷?任何意见或建议将是最有帮助的.

Other than the security implications (arbitrary code execution) and the requirement that modules like scikit-learn will have to be installed on the machine thats serving the model, are there and any other pitfalls in this approach? Any comments or words of advice would be most helpful.

我认为 YHat Dato 采取了类似的方法,但出于类似的目的推出了自己的Dill实现.

I think that YHat and Dato have taken similar approach but rolled out there own implementations of Dill for similar purposes.

推荐答案

我是dill的作者. dill的构建完全可以完成您的工作(将数值拟合保留在类实例中以进行统计),然后可以将这些对象分配给不同的资源并以令人尴尬的并行方式运行.因此,答案是肯定的-我已经使用 mystic 和/或

I'm the dill author. dill was built to do exactly what you are doing… (to persist numerical fits within class instances for statistics) where these objects can then be distributed to different resources and run in an embarrassingly parallel fashion. So, the answer is yes -- I have run code like yours, using mystic and/or sklearn.

请注意,许多sklearn的作者使用cloudpickle来启用sklearn对象而不是dill上的并行计算. dill可以比cloudpickle腌制更多类型的对象,但是cloudpickle(在撰写本文时)在对作为闭包的一部分引用全局字典的对象进行腌制方面略胜一筹-默认情况下,通过引用执行此操作,而cloudpickle物理存储相关性.但是,dill具有"recurse"模式,其作用与cloudpickle相似,因此使用此模式时的区别很小. (要启用"recurse"模式,请执行dill.settings['recurse'] = True,或将recurse=True用作dill.dump中的标志).另一个小区别是cloudpickle包含对scikits.timeseriesPIL.Image之类的特殊支持,而dill则没有.

Note that many of the authors of sklearn use cloudpickle for enabling parallel computing on sklearn objects, and not dill. dill can pickle more types of objects than cloudpickle, however cloudpickle is slightly better (at this time of writing) at pickling objects that make references to the global dictionary as part of a closure -- by default, dill does this by reference, while cloudpickle physically stores the dependencies. However, dill has a "recurse" mode, that acts like cloudpickle, so the difference when using this mode is minor. (To enable "recurse" mode, do dill.settings['recurse'] = True, or use recurse=True as a flag in dill.dump). Another minor difference is that cloudpickle contains special support for things like scikits.timeseries and PIL.Image, while dill does not.

从正面来看,dill不会通过引用来对类进行酸洗,因此通过对类实例进行酸洗,它可以对类对象本身进行序列化-这是一个很大的优势,因为它可以对分类器,模型的派生类的实例进行序列化,以及其他类似的内容,例如sklearn在酸洗时的确切状态……因此,如果您对类对象进行了修改,则实例仍会正确解包.除了对象范围更广(通常是较小的泡菜)之外,dillcloudpickle相比还有其他优点-但是,我不在这里列出它们.您要求陷阱,所以差异不是陷阱.

On the plus side, dill does not pickle classes by reference, so by pickling a class instance, it serializes the class object itself -- which is a big advantage, as it serializes instances of derived classes of classifiers, models, and etc from sklearn in their exact state at the time of pickling… so if you make modifications to the class object, the instance still unpicks correctly. There are other advantages of dill over cloudpickle, aside from the broader range of objects (and typically a smaller pickle) -- however, I won't list them here. You asked for pitfalls, so differences are not pitfalls.

主要陷阱:

  • 您应该在您的计算机上安装您的类引用的任何内容 远程计算机,以防万一dill(或cloudpickle)将其腌制 参考.

  • You should have anything your classes refer to installed on the remote machine, just in case dill (or cloudpickle) pickles it by reference.

您应该尝试将您的类和类方法设置为 尽可能独立的(例如,不要引用在中定义的对象 您班级的全局范围).

You should try to make your classes and class methods as self-contained as possible (e.g. don't refer to objects defined in the global scope from your classes).

sklearn对象可以很大,因此可以将许多对象保存到一个对象中 泡菜并不总是一个好主意……您可能想使用 klepto

sklearn objects can be big, so saving many of them to a single pickle is not always a good idea… you might want to use klepto which has a dict interface to caching and archiving, and enables you to configure the archive interface to store each key-value pair individually (e.g. one entry per file).

这篇关于使用Dill序列化scikit-learn/statsmodels模型有哪些陷阱?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆