将scikit-learn与pyspark集成 [英] integrating scikit-learn with pyspark

查看:90
本文介绍了将scikit-learn与pyspark集成的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在探索pyspark,以及将scikit-learn与pyspark集成的可能性.我想使用scikit-learn在每个分区上训练模型.这意味着,当我定义了RDD并将其分布在不同的工作程序节点之间时,我想使用scikit-learn并在每个工作程序节点上存在的每个分区上训练模型(假设是简单的k均值).由于scikit-learn算法采用的是Pandas数据框,所以我的最初想法是为每个分区调用 toPandas ,然后训练我的模型.但是, toPandas 函数将DataFrame收集到驱动程序中,这不是我想要的东西.还有其他方法可以实现这一目标吗?

I'm exploring pyspark and the possibilities of integrating scikit-learn with pyspark. I'd like to train a model on each partition using scikit-learn. That means, when my RDD is is defined and gets distributed among different worker nodes, I'd like to use scikit-learn and train a model (let's say a simple k-means) on each partition which exists on each worker node. As scikit-learn algorithms takes a Pandas dataframe, my initial idea was to call toPandas for each partition and then train my model. However, the toPandas function collects the DataFrame into the driver and this is not something that I'm looking for. Is there any other way to achieve such a goal?

推荐答案

目前尚无法将Scikit-learn与spark完全集成,原因是尚未实现scikit-learn算法的分布式分布只能在一台机器上工作.

scikit-learn can't be fully integrated with spark as for now, and the reason is that scikit-learn algorithms aren't implemented to be distributed as it work just on a single machine.

不过,您可以在 spark-sklearn 中找到可以使用Spark-Scikit集成工具的工具.支持(目前)在Spark上执行GridSearch进行交叉验证.

Nevertheless, you can find ready to use Spark - Scikit integration tools in spark-sklearn that supports (for the moments) executing GridSearch on Spark for cross validation.

修改

从2020年开始,不推荐使用 spark-sklearn ,而

As of 2020 the spark-sklearn is deprecated and the joblib-spark is the recommended successor of it. Based on the documentation you can easily distribute a cross validation to a Spark cluster like this:

from sklearn.utils import parallel_backend
from sklearn.model_selection import cross_val_score
from sklearn import datasets
from sklearn import svm
from joblibspark import register_spark

register_spark() # register spark backend

iris = datasets.load_iris()
clf = svm.SVC(kernel='linear', C=1)
with parallel_backend('spark', n_jobs=3):
  scores = cross_val_score(clf, iris.data, iris.target, cv=5)

print(scores)

可以以相同的方式分发GridSearchCV.

A GridSearchCV can be distributed in the same way.

这篇关于将scikit-learn与pyspark集成的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆