在Spark中分配scikit学习分类器的推荐方法是什么? [英] What is the recommended way to distribute a scikit learn classifier in spark?

查看:64
本文介绍了在Spark中分配scikit学习分类器的推荐方法是什么?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我已经使用scikit learning建立了一个分类器,现在我想使用spark在大型数据集上运行predict_proba.我目前使用以下方法腌制分类器:

I have built a classifier using scikit learn and now I would like to use spark to run predict_proba on a large dataset. I currently pickle the classifier once using:

import pickle
pickle.dump(clf, open('classifier.pickle', 'wb'))

然后在我的火花代码中,我使用sc.broadcast广播此腌菜,以在我的火花代码中使用,该代码必须在每个群集节点上加载它.

and then in my spark code I broadcast this pickle using sc.broadcast for use in my spark code which has to load it in at each cluster node.

这有效,但是泡菜很大(约0.5GB),效率似乎很低.

This works but the pickle is large (about 0.5GB) and it seems very inefficient.

有更好的方法吗?

推荐答案

这有效,但是泡菜很大(约0.5GB)

This works but the pickle is large (about 0.5GB)

请注意,森林的大小将为O(M*N*Log(N)),其中M是树木数,N是样本数. (源)

Note that the size of the forest will be O(M*N*Log(N)), where M is the number of trees and N is the number of samples. (source)

有更好的方法吗?

Is there a better way to do this?

您可以尝试使用几种方法来减小RandomForestClassifier模型或序列化文件的大小:

There several options you can try to reduce the size of either your RandomForestClassifier model, or the serialized file:

  • 通过优化超参数,尤其是max_depth, max_leaf_nodes, min_samples_split,因为这些参数会影响集成中所用树的大小

  • reduce the size of the model by optimizing hyperparameters, in particular max_depth, max_leaf_nodes, min_samples_split as these parameters influence the size of the trees used in the ensemble

将泡菜拉上拉链,例如如下.请注意,有几个选项,一个选项可能更适合您,因此您需要尝试:

zip the pickle, e.g. as follows. Note there are several options and one might fit you better, so you'll need to try:

with gzip.open('classifier.pickle', 'wb') as f:
    pickle.dump(clf, f)

  • 使用joblib而不是pickle,它的压缩效果更好,并且也是这里的警告是joblib将在一个目录中创建多个文件,因此您必须将它们压缩以进行传输.

    The caveat here is that joblib will create multiple files in a directory, so you'll have to zip these up for transport.

    最后但并非最不重要的一点是,如

    last but not least you can also try reducing the size of the input by dimensionality reduction before you fit/predict using the RandomTreeClassifier, as mentioned in the practical tips on decision trees.

    YMMV

    这篇关于在Spark中分配scikit学习分类器的推荐方法是什么?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

  • 查看全文
    登录 关闭
    扫码关注1秒登录
    发送“验证码”获取 | 15天全站免登陆