什么是广播变量?他们解决什么问题? [英] What are broadcast variables? What problems do they solve?

查看:597
本文介绍了什么是广播变量?他们解决什么问题?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在阅读Spark编程指南,内容为:

I am going through Spark Programming guide that says:

广播变量允许程序员在每台计算机上保留一个只读变量,而不是将其副本与任务一起发送.

Broadcast variables allow the programmer to keep a read-only variable cached on each machine rather than shipping a copy of it with tasks.

考虑到上述情况,广播变量的用例是什么?广播变量可以解决什么问题?

Considering the above, what are the use cases of broadcast variables? What problems do broadcast variables solve?

当我们创建如下所示的任何广播变量时​​,该变量引用是在集群中的所有节点中可用的broadcastVar吗?

When we create any broadcast variable like below, the variable reference, here it is broadcastVar available in all the nodes in the cluster?

val broadcastVar = sc.broadcast(Array(1, 2, 3))

这些变量在节点的内存中可用多长时间?

How long these variables available in the memory of the nodes?

推荐答案

如果您有从Spark Closures访问的巨大数组(例如,一些参考数据),则此数组将随Closure一起运送到每个Spark节点.例如,如果您有10个具有100个分区的节点集群(每个节点10个分区),则此数组将至少分配100次(每个节点10次).

If you have huge array that is accessed from Spark Closures, for example some reference data, this array will be shipped to each spark node with closure. For example if you have 10 nodes cluster with 100 partitions (10 partitions per node), this Array will be distributed at least 100 times (10 times to each node).

如果您使用广播,它将使用高效的p2p协议在每个节点上分发一次.

If you use broadcast it will be distributed once per node using efficient p2p protocol.

val array: Array[Int] = ??? // some huge array
val broadcasted = sc.broadcast(array)

还有一些RDD

val rdd: RDD[Int] = ???

在这种情况下,每次关闭时阵列都会被运送

In this case array will be shipped with closure each time

rdd.map(i => array.contains(i))

通过广播,您将获得巨大的性能优势

and with broadcast you'll get huge performance benefit

rdd.map(i => broadcasted.value.contains(i))

这篇关于什么是广播变量?他们解决什么问题?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆