区分Apache Spark中的驱动程序代码和工作代码 [英] differentiate driver code and work code in Apache Spark

查看:79
本文介绍了区分Apache Spark中的驱动程序代码和工作代码的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在Apache Spark程序中,我们如何知道哪部分代码将在驱动程序中执行以及哪一部分代码将在工作程序节点中执行?

In Apache Spark program how do we know which part of code will execute in driver program and which part of code will execute in worker nodes?

考虑

推荐答案

实际上非常简单.转换创建的闭包内部发生的所有事情都在工作程序上发生.这意味着是否在工作程序上执行了map(...)filter(...)mapPartitions(...)groupBy*(...)aggregateBy*(...)内部传递的内容.它包括从持久性存储或远程源中读取数据.

It is actually pretty simple. Everything that happens inside the closure created by a transformation happens on a worker. It means if something is passed inside map(...), filter(...), mapPartitions(...), groupBy*(...), aggregateBy*(...) is executed on the workers. It includes reading data from a persistent storage or remote sources.

countreduce(...)fold(...)之类的动作通常在驱动程序和工作程序上执行.繁重的工作由工人并行执行,最后一些步骤,例如减少从工人那里收到的输出,是在驾驶员身上依次执行的.

Actions like count, reduce(...), fold(...) are usually executed on both driver and workers. Heavy lifting is performed in parallel by the workers and some final steps, like reducing outputs received from the workers, is performed sequentially on the driver.

其他所有事情,例如触发动作或转换都发生在驱动程序上.特别是,这意味着需要访问SparkContext的每个操作.在PySpark中,这还意味着与Py4j网关进行通信.

Everything else, like triggering an action or transformation happens on the driver. In particular it means every action which requires access to SparkContext. In PySpark it means also a communication with Py4j gateway.

这篇关于区分Apache Spark中的驱动程序代码和工作代码的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆