在同一JVM中运行多个Spark任务有什么好处? [英] What are the benefits of running multiple Spark tasks in the same JVM?

查看:282
本文介绍了在同一JVM中运行多个Spark任务有什么好处?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

不同的来源(例如 1

Different sources (e.g. 1 and 2) claim that Spark can benefit from running multiple tasks in the same JVM. But they don't explain why.

这些好处是什么?

推荐答案

已经说过广播变量是一回事.

As it was already said broadcast variables is one thing.

另一个是并发问题.看一下这段代码:

Another are problems with concurrency. Take a look at this of code:

var counter = 0
var rdd = sc.parallelize(data)

rdd.foreach(x => counter += x)

println(counter)

结果可能会有所不同,具体取决于是在本地执行还是在群集上部署的Spark(具有不同的JVM)上执行.在后一种情况下,parallelize方法在执行程序之间拆分计算.计算闭包(每个节点执行其任务所需的环境),这意味着每个执行者都会收到counter的副本.每个执行者都会看到自己的变量副本,因此计算结果为0,因为没有一个执行者引用了正确的对象.另一方面,在一个JVM中,每个工作人员都可以看到counter.

The result may be different depending, whether executed locally or on a Spark deployed on clusters (with different JVM). In the latter case the parallelize method splits the computation between the executors. The closure (environment needed for every node to do its task) is computed, which means, that every executor receives a copy of counter. Each executor sees its own copy of the variable, thus the result of the calculation is 0, as none of the executor referenced the right object. Within one JVM on the other hand counter is visible to every worker.

当然,有一种方法可以避免这种情况-使用Acumulator s(

Of course there is a way to avoid that - using Acumulators (see here).

最后但并非最不重要的一点是,将RDD持久存储在内存中时(默认cache方法存储级别为MEMORY_ONLY),它将在单个JVM中可见.也可以通过使用OFF_HEAP来克服(这在2.4.0中是实验性的). 此处.

Last but not least when persisting RDDs in memory (default cache method storage level is MEMORY_ONLY), it will be visible within single JVM. This can also be overcome by using OFF_HEAP (this is experimental in 2.4.0). More here.

这篇关于在同一JVM中运行多个Spark任务有什么好处?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆