如何在Spark集群中分配任务? [英] How are tasks distributed within a Spark cluster?

查看:197
本文介绍了如何在Spark集群中分配任务?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

因此,我有一个输入,该输入包含一个数据集和一些使用scikit-learn的ML算法(带有参数调整).我已经尝试了很多尝试来尽可能高效地执行此操作,但是目前我仍然没有合适的基础结构来评估我的结果.但是,我在这方面缺乏背景知识,需要帮助来解决问题.

So I have an input that consists in a dataset and several ML algorithms (with parameter tuning) using scikit-learn. I have tried quite a few attempts on how to execute this as efficiently as possible but at this very moment I still don't have the proper infrastructure to assess my results. However, I lack some background on this area and I need help to get things cleared up.

我基本上想知道如何以尽可能多地利用所有可用资源的方式分配任务,以及隐式完成的操作(例如由Spark执行)和未执行的操作.

Basically I want to know how the tasks are distributed in a way that exploits as much as possible all the available resources, and what is actually done implicitly (for instance by Spark) and what isn't.

这是我的情况:

我需要训练许多不同的决策树模型(与所有可能参数的组合一样多),许多不同的随机森林模型,等等...

I need to train many different Decision Tree models (as many as the combination of all possible parameters), many different Random Forest models, and so on...

在我的一种方法中,我有一个列表,每个元素对应一个ML算法及其参数列表.

In one of my approaches, I have a list and each of its elements corresponds to one ML algorithm and its list of parameters.

spark.parallelize(algorithms).map(lambda algorihtm: run_experiment(dataframe, algorithm))

在此函数run_experiment中,我为带有参数网格的相应ML算法创建了GridSearchCV.我还设置了n_jobs=-1以便(尝试)实现最大并行度.

In this function run_experiment I create a GridSearchCV for the corresponding ML algorithm with its parameter grid. I also set n_jobs=-1 in order to (try to) achieve maximum parallelism.

在这种情况下,在我的带有几个节点的Spark集群上,执行看起来像这样有意义吗?

In this context, on my Spark cluster with a few nodes, does it make sense that the execution would look somewhat like this?

或者在同一节点上运行一个决策树模型和一个随机森林模型?这是我第一次使用集群环境,因此我对如何期望工作正常感到困惑.

Or there can be one Decision Tree model and also one Random Forest model running in the same node? This is my first experience using a cluster environment so I am a bit confused on how to expect things to work.

另一方面,执行方面的确切变化是,如果不是使用parallelize的第一种方法,而是使用for循环依次遍历我的算法列表,并使用databricks的方法创建GridSearchCV Spark和scikit-learn之间是否 spark-sklearn 集成?在文档中说明的方式似乎是这样的:

On the other hand, what exactly changes in terms of execution, if instead of the first approach with parallelize, I use a for loop to sequentially iterate through my list of algorithms and create the GridSearchCV using databricks's spark-sklearn integration between Spark and scikit-learn? The way it's illustrated in the documentation it seems something like this:

最后,关于第二种方法,使用相同的ML算法,但使用Spark MLlib而不是scikit-learn,是否可以照顾整个并行化/分发?

Finally, with regards to this second approach, using the same ML algorithms but instead with Spark MLlib instead of scikit-learn, would the whole parallelization/distribution be taken care of?

很抱歉,如果其中大部分内容都还太幼稚,但我真的很感谢对此的任何回答或见解.在进行集群实际测试和使用任务调度参数之前,我想了解基础知识.

Sorry if most of this is a bit naive, but I really appreciate any answers or insights on this. I wanted to understand the basics before actually testing in the cluster and playing with task scheduling parameters.

我不确定这个问题在这里还是在CS stackexchange上更合适.

推荐答案

spark.parallelize(algorithms).map(...)

来自参考,元素复制集合中的一部分以形成可以并行操作的分布式数据集."这意味着您的算法将散布在节点之间.从那里开始,所有算法都将执行.

From the ref, "The elements of the collection are copied to form a distributed dataset that can be operated on in parallel." That means that your algorithms are going to be scattered among your nodes. From there, every algorithm will execute.

如果算法和它们各自的参数以这种方式分散,那么您的方案可能是有效的,我认为您就是这种情况.

Your scheme could be valid, if the algorithms and their respective parameters were scattered that way, which I think is the case for you.

关于使用所有资源,是非常擅长于此.但是,为了获得良好的性能,您需要检查各个任务之间的工作负载是否平衡(每个任务要做相同的工作量).

About using all your resources, spark is very good at this. However, you need to check that the workload is balanced among your tasks (every task to do the same amount of work), in order to get good performance.

如果我使用for循环代替parallelize的第一种方法,会发生什么变化?

What changes if instead of the first approach with parallelize, I use a for loop?

一切.您的数据集(在您的情况下为算法)不是RDD,因此不会发生并行执行.

Everything. Your dataset (algorithms in your case) is not an RDD, thus no parallel execution occurs.

..并且还在Spark和scikit-learn之间使用了databricks的spark-sklearn集成吗?

.. and also using databricks's spark-sklearn integration between Spark and scikit-learn?

文章描述了如何在其中实施随机森林:

This article describes how Random Forests are implemented there:

" Spark的scikit-learn软件包提供了交叉验证算法的替代实现,该算法可在Spark集群上分配工作量.每个节点都使用scikit-learn库的本地副本运行训练算法,并报告最好的模型交还给大师."

"The scikit-learn package for Spark provides an alternative implementation of the cross-validation algorithm that distributes the workload on a Spark cluster. Each node runs the training algorithm using a local copy of the scikit-learn library, and reports the best model back to the master."

我们可以将其推广到您的所有算法,这使您的方案合理.

We can generalize this to all your algorithms, which make your scheme reasonable.

使用spark MLlib而不是scikit-learn,是否可以照顾整个并行化/分发?

Spark MLlib instead of scikit-learn, would the whole parallelization/distribution be taken care of?

是的,会的.他们认为这两个库都是为了照顾我们,让我们的生活更轻松.

Yes, it would. They idea of both of this library is to take care things for us, so that we make our lives easier.

我建议您一次问一个大问题,因为现在的答案太广泛了,但我会尽量保持简洁.

这篇关于如何在Spark集群中分配任务?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆