如何在Spark集群中分配任务? [英] How are tasks distributed within a Spark cluster?

查看：197 发布时间：2020/5/4 9:37:58 apache-spark machine-learning parallel-processing scikit-learn cluster-computing

本文介绍了如何在Spark集群中分配任务?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

因此，我有一个输入，该输入包含一个数据集和一些使用scikit-learn的ML算法(带有参数调整).我已经尝试了很多尝试来尽可能高效地执行此操作，但是目前我仍然没有合适的基础结构来评估我的结果.但是，我在这方面缺乏背景知识，需要帮助来解决问题.

So I have an input that consists in a dataset and several ML algorithms (with parameter tuning) using scikit-learn. I have tried quite a few attempts on how to execute this as efficiently as possible but at this very moment I still don't have the proper infrastructure to assess my results. However, I lack some background on this area and I need help to get things cleared up.

我基本上想知道如何以尽可能多地利用所有可用资源的方式分配任务，以及隐式完成的操作(例如由Spark执行)和未执行的操作.

Basically I want to know how the tasks are distributed in a way that exploits as much as possible all the available resources, and what is actually done implicitly (for instance by Spark) and what isn't.

这是我的情况:

我需要训练许多不同的决策树模型(与所有可能参数的组合一样多)，许多不同的随机森林模型，等等...

I need to train many different Decision Tree models (as many as the combination of all possible parameters), many different Random Forest models, and so on...

在我的一种方法中，我有一个列表，每个元素对应一个ML算法及其参数列表.

In one of my approaches, I have a list and each of its elements corresponds to one ML algorithm and its list of parameters.

spark.parallelize(algorithms).map(lambda algorihtm: run_experiment(dataframe, algorithm))

在此函数run_experiment中，我为带有参数网格的相应ML算法创建了GridSearchCV.我还设置了n_jobs=-1以便(尝试)实现最大并行度.

In this function run_experiment I create a GridSearchCV for the corresponding ML algorithm with its parameter grid. I also set n_jobs=-1 in order to (try to) achieve maximum parallelism.

在这种情况下，在我的带有几个节点的Spark集群上，执行看起来像这样有意义吗?

In this context, on my Spark cluster with a few nodes, does it make sense that the execution would look somewhat like this?

或者在同一节点上运行一个决策树模型和一个随机森林模型?这是我第一次使用集群环境，因此我对如何期望工作正常感到困惑.

Or there can be one Decision Tree model and also one Random Forest model running in the same node? This is my first experience using a cluster environment so I am a bit confused on how to expect things to work.

另一方面，执行方面的确切变化是，如果不是使用parallelize的第一种方法，而是使用for循环依次遍历我的算法列表，并使用databricks的方法创建GridSearchCV Spark和scikit-learn之间是否 spark-sklearn 集成?在文档中说明的方式似乎是这样的:

On the other hand, what exactly changes in terms of execution, if instead of the first approach with parallelize, I use a for loop to sequentially iterate through my list of algorithms and create the GridSearchCV using databricks's spark-sklearn integration between Spark and scikit-learn? The way it's illustrated in the documentation it seems something like this:

最后，关于第二种方法，使用相同的ML算法，但使用Spark MLlib而不是scikit-learn，是否可以照顾整个并行化/分发?

Finally, with regards to this second approach, using the same ML algorithms but instead with Spark MLlib instead of scikit-learn, would the whole parallelization/distribution be taken care of?

很抱歉，如果其中大部分内容都还太幼稚，但我真的很感谢对此的任何回答或见解.在进行集群实际测试和使用任务调度参数之前，我想了解基础知识.

Sorry if most of this is a bit naive, but I really appreciate any answers or insights on this. I wanted to understand the basics before actually testing in the cluster and playing with task scheduling parameters.

_{我不确定这个问题在这里还是在CS stackexchange上更合适.}

如何在Spark集群中分配任务? [英] How are tasks distributed within a Spark cluster?

问题描述

推荐答案

相关文章

AI人工智能最新文章

热门教程

热门工具

登录关闭

如何在Spark集群中分配任务? [英] How are tasks distributed within a Spark cluster?

问题描述

推荐答案

相关文章

AI人工智能最新文章

热门教程

热门工具

登录 关闭

登录关闭