广播变量的优势 [英] Advantage of Broadcast Variables

查看:197
本文介绍了广播变量的优势的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我是新来的火花,探索其功能。我会通过星火编程指南和它说:

I am new to Spark and exploring its features. I am going through Spark Programming guide and it says:

广播变量允许程序员保持每台机器上一个只读变量缓存,而不是出货它的一个副本任务。

Broadcast variables allow the programmer to keep a read-only variable cached on each machine rather than shipping a copy of it with tasks.

考虑到上述情况,有什么用广播变量的优势是什么?

Considering the above, what are the advantages of using Broadcast variables ?

当我们创建任何广播变量如下图所示,变量引用,这里是broadcastVar在集群中的所有节点可用?

When we create any broadcast variable like below, the variable reference, here it is "broadcastVar" available in all the nodes in the cluster ?

VAL broadcastVar = sc.broadcast(阵列(1,2,3))

val broadcastVar = sc.broadcast(Array(1, 2, 3))

多久这些变量可以在节点的内存?

How long these variables available in the memory of the nodes ?

推荐答案

如果你有一个从星火瓶盖访问巨大的数组,例如一些参考的数据,这个​​数组将被运到与封闭每个火花节点。例如,如果你有100个分区(每节点10个分区),这个阵列会被分配了至少100倍(10倍到每个节点)10节点集群。

If you have huge array that is accessed from Spark Closures, for example some reference data, this array will be shipped to each spark node with closure. For example if you have 10 nodes cluster with 100 partitions (10 partitions per node), this Array will be distributed at least 100 times (10 times to each node).

如果您使用的广播将被每一次使用节点高效的P2P协议分配。

If you use broadcast it will be distributed once per node using efficient p2p protocol.

val array: Array[Int] = ??? // some huge array
val broadcasted = sc.broadcast(array)

和一些RDD

val rdd: RDD[Int] = ???

在这种情况下,阵列将关闭每次发货

In this case array will be shipped with closure each time

rdd.map(i => array.contains(i))

和与广播,你会获得巨大的性能优势。

and with broadcast you'll get huge performance benefit

rdd.map(i => broadcasted.value.contains(i))

这篇关于广播变量的优势的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆