使用Dill序列化scikit-learn/statsmodels模型有哪些陷阱? [英] What are the pitfalls of using Dill to serialise scikit-learn/statsmodels models?
问题描述
我需要对scikit-learn/statsmodels模型进行序列化,以便将所有依赖项(代码+数据)打包在人工制品中,并且可以使用该人工制品来初始化模型并进行预测.不能使用pickle module
,因为这只会处理数据依赖关系(不会打包代码).因此,我一直在与莳萝进行实验.为了使我的问题更精确,下面是一个示例,其中我构建了一个模型并将其持久化.
I need to serialise scikit-learn/statsmodels models such that all the dependencies (code + data) are packaged in an artefact and this artefact can be used to initialise the model and make predictions. Using the pickle module
is not an option because this will only take care of the data dependency (the code will not be packaged). So, I have been conducting experiments with Dill. To make my question more precise, the following is an example where I build a model and persist it.
from sklearn import datasets
from sklearn import svm
from sklearn.preprocessing import Normalizer
import dill
digits = datasets.load_digits()
training_data_X = digits.data[:-5]
training_data_Y = digits.target[:-5]
test_data_X = digits.data[-5:]
test_data_Y = digits.target[-5:]
class Model:
def __init__(self):
self.normalizer = Normalizer()
self.clf = svm.SVC(gamma=0.001, C=100.)
def train(self, training_data_X, training_data_Y):
normalised_training_data_X = normalizer.fit_transform(training_data_X)
self.clf.fit(normalised_training_data_X, training_data_Y)
def predict(self, test_data_X):
return self.clf.predict(self.normalizer.fit_transform(test_data_X))
model = Model()
model.train(training_data_X, training_data_Y)
print model.predict(test_data_X)
dill.dump(model, open("my_model.dill", 'w'))
与此对应,这是我初始化持久化模型(在新的会话中)并进行预测的方法.请注意,此代码没有明确初始化或了解class Model
.
Corresponding to this, here is how I initialise the persisted model (in a new session) and make a prediction. Note that this code does not explicitly initialise or have knowledge of the class Model
.
import dill
from sklearn import datasets
digits = datasets.load_digits()
training_data_X = digits.data[:-5]
training_data_Y = digits.target[:-5]
test_data_X = digits.data[-5:]
test_data_Y = digits.target[-5:]
with open("my_model.dill") as model_file:
model = dill.load(model_file)
print model.predict(test_data_X)
有人用过莳萝吗?数据科学家的想法是为他们实现的每个模型扩展ModelWrapper class
,然后围绕该模型构建基础结构以保留模型,将模型作为服务部署并管理模型的整个生命周期.
Has anyone used Dill isn this way?. The idea is for a data scientist to extend a ModelWrapper class
for each model they implement and then build the infrastructure around this that persists the models, deploy the models as services and manage the entire lifecycle of the model.
class ModelWrapper(object):
__metaclass__ = abc.ABCMeta
def __init__(self, model):
self.model = model
@abc.abstractmethod
def predict(self, input):
return
def dumps(self):
return dill.dumps(self)
def loads(self, model_string):
self.model = dill.loads(model_string)
除了安全隐患(任意代码执行)以及像模型那样的机器上必须安装scikit-learn
之类的模块的要求之外,这种方法是否存在其他缺陷?任何意见或建议将是最有帮助的.
Other than the security implications (arbitrary code execution) and the requirement that modules like scikit-learn
will have to be installed on the machine thats serving the model, are there and any other pitfalls in this approach? Any comments or words of advice would be most helpful.
我认为 YHat 和 Dato 采取了类似的方法,但出于类似的目的推出了自己的Dill实现.
I think that YHat and Dato have taken similar approach but rolled out there own implementations of Dill for similar purposes.
推荐答案
我是dill
的作者. dill
的构建完全可以完成您的工作(将数值拟合保留在类实例中以进行统计),然后可以将这些对象分配给不同的资源并以令人尴尬的并行方式运行.因此,答案是肯定的-我已经使用 mystic
和/或
I'm the dill
author. dill
was built to do exactly what you are doing… (to persist numerical fits within class instances for statistics) where these objects can then be distributed to different resources and run in an embarrassingly parallel fashion. So, the answer is yes -- I have run code like yours, using mystic
and/or sklearn
.
请注意,许多sklearn
的作者使用cloudpickle
来启用sklearn
对象而不是dill
上的并行计算. dill
可以比cloudpickle
腌制更多类型的对象,但是cloudpickle
(在撰写本文时)在对作为闭包的一部分引用全局字典的对象进行腌制方面略胜一筹-默认情况下,cloudpickle
物理存储相关性.但是,dill
具有"recurse"
模式,其作用与cloudpickle
相似,因此使用此模式时的区别很小. (要启用"recurse"
模式,请执行dill.settings['recurse'] = True
,或将recurse=True
用作dill.dump
中的标志).另一个小区别是cloudpickle
包含对scikits.timeseries
和PIL.Image
之类的特殊支持,而dill
则没有.
Note that many of the authors of sklearn
use cloudpickle
for enabling parallel computing on sklearn
objects, and not dill
. dill
can pickle more types of objects than cloudpickle
, however cloudpickle
is slightly better (at this time of writing) at pickling objects that make references to the global dictionary as part of a closure -- by default, dill
does this by reference, while cloudpickle
physically stores the dependencies. However, dill
has a "recurse"
mode, that acts like cloudpickle
, so the difference when using this mode is minor. (To enable "recurse"
mode, do dill.settings['recurse'] = True
, or use recurse=True
as a flag in dill.dump
). Another minor difference is that cloudpickle
contains special support for things like scikits.timeseries
and PIL.Image
, while dill
does not.
从正面来看,dill
不会通过引用来对类进行酸洗,因此通过对类实例进行酸洗,它可以对类对象本身进行序列化-这是一个很大的优势,因为它可以对分类器,模型的派生类的实例进行序列化,以及其他类似的内容,例如sklearn
在酸洗时的确切状态……因此,如果您对类对象进行了修改,则实例仍会正确解包.除了对象范围更广(通常是较小的泡菜)之外,dill
与cloudpickle
相比还有其他优点-但是,我不在这里列出它们.您要求陷阱,所以差异不是陷阱.
On the plus side, dill
does not pickle classes by reference, so by pickling a class instance, it serializes the class object itself -- which is a big advantage, as it serializes instances of derived classes of classifiers, models, and etc from sklearn
in their exact state at the time of pickling… so if you make modifications to the class object, the instance still unpicks correctly. There are other advantages of dill
over cloudpickle
, aside from the broader range of objects (and typically a smaller pickle) -- however, I won't list them here. You asked for pitfalls, so differences are not pitfalls.
主要陷阱:
-
您应该在您的计算机上安装您的类引用的任何内容 远程计算机,以防万一
dill
(或cloudpickle
)将其腌制 参考.
You should have anything your classes refer to installed on the remote machine, just in case
dill
(orcloudpickle
) pickles it by reference.
您应该尝试将您的类和类方法设置为 尽可能独立的(例如,不要引用在中定义的对象 您班级的全局范围).
You should try to make your classes and class methods as self-contained as possible (e.g. don't refer to objects defined in the global scope from your classes).
sklearn
对象可以很大,因此可以将许多对象保存到一个对象中
泡菜并不总是一个好主意……您可能想使用 klepto
sklearn
objects can be big, so saving many of them to a single
pickle is not always a good idea… you might want to use klepto
which has a dict
interface to caching and archiving, and enables you to configure the archive interface to store each key-value pair individually (e.g. one entry per file).
这篇关于使用Dill序列化scikit-learn/statsmodels模型有哪些陷阱?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!