什么时候不在Apache Spark的驱动程序上运行某项操作? [英] When does an action not run on the driver in Apache Spark?

查看:58
本文介绍了什么时候不在Apache Spark的驱动程序上运行某项操作?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我刚开始使用Spark,并且在任务的概念上苦苦挣扎.

I have just started with Spark and was struggling with the concept of tasks.

任何人都可以帮助我理解何时在驱动程序中未执行某项操作(例如减少操作).

Can any one please help me in understanding when does an action (say reduce) not run in the driver program.

在Spark教程中,

使用函数func(其中 接受两个参数并返回一个).该功能应该是 可交换和可关联的,以便可以在 平行线. "

"Aggregate the elements of the dataset using a function func (which takes two arguments and returns one). The function should be commutative and associative so that it can be computed correctly in parallel. "

我目前正在尝试一个应用程序,该应用程序读取'n'个文件上的目录并计算单词数.

I'm currently experimenting with an application which reads a directory on 'n' files and counts the number of words.

在Web UI中,任务数等于文件数.而且所有的reduce功能都在驱动程序节点上进行.

From the web UI the number of tasks is equal to number of files. And all the reduce functions are taking place on the driver node.

能否请您说明一个场景,其中reduce函数不会在驱动程序上执行.任务总是包含转换+动作"还是仅包含转换"

Can you please tell a scenario where the reduce function won't execute at the driver. Does a task always include "transformation+action" or only "transformation"

推荐答案

所有操作都在群集上执行,操作结果可能最终取决于驱动程序.

All the actions are performed on the cluster and results of the actions may end up on the driver (depending on the action).

通常来说,您围绕业务逻辑编写的spark代码不是实际运行的程序-而是spark用它来创建计划,该计划将在集群中执行您的代码.该计划创建了一个任务,该任务可以在分区上完成的所有操作而无需重新整理数据.每当spark需要将数据按不同的顺序排列(例如在排序之后)时,它将创建一个新任务,并在前一个任务和后一个任务之间进行随机播放

Generally speaking the spark code you write around your business logic is not the program that would actually run - rather spark uses it to create a plan which will execute your code in the cluster. The plan creates a task of all the actions that can be done on a partition without the need to shuffle data around. Every time spark needs the data arranged differently (e.g. after sorting) It will create a new task and a shuffle between the first and the latter tasks

这篇关于什么时候不在Apache Spark的驱动程序上运行某项操作?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆