广播变量火花 [英] BroadCast Variables In Spark

查看:212
本文介绍了广播变量火花的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

可以说我有以下code群上运行:

Lets say i have the following code running on a cluster:

private def modifyDatasetFormat(data : String, mappings : Array[HashMap[String, Int]]) : Array[Tuple2[Tuple3[Int,Int,Int],Int]] = {

}

var  map = new HashMap[String,Int]()
map+=("hello" -> 2)

var mappings = new Array[HashMap[String, Int]])(1)
mappings(0)=map

val originalDataset = sc.textFile("/home/paourissi/Desktop/MyProject/nursery.1000.withID")
        val dataset = originalDataset.
            flatMap(data => modifyDatasetFormat(data, mappingsInMap)).persist(StorageLevel.MEMORY_AND_DISK) 

我是否需要使用广播的值映射或不需要呢?
基本上,当我们用广播变数呢?为了提高效率?

Do i need to use broadcast for the value mappings or it is not needed? Basically when do we use broadcast variables anyway? For efficiency?

感谢您。

推荐答案

首先,广播变量被设计为在整个群集共享,并在同一时间都能够适合在存储器中的一个机器上。

First of all, Broadcast variables are designed to be shared throughout a cluster and, at the same time have to be able to fit in memory on one machine.

其次,广播变量是不变的,所以他们不能在以后更改(如果看一看累加器)。

Secondly, broadcast variables are immutable, so they cannot be changed later on (in case take a look at accumulators).

效率
内部火花,集群中的所有节点尝试通过下载什么,他们可以和上载什么,他们可以快速,高效地分配变量。这使得不必试图做的一切,并将数据推送到所有节点其速度远远超过一个节点。

Efficiency: Inside Spark, all the nodes in the cluster try to distribute the variable as quickly and efficiently as possible by downloading what they can, and uploading what they can. This makes them much faster than one node having to try and do everything and push the data to all nodes.

由于在阿帕奇星火文档参考,广播变量是一个伟大的情况下,静态查找表

As referenced in the Apache Spark documentation , broadcast variables are a great case for "static look up tables"

您可能会喜欢这个有趣的帖子通过的 SparkTutorials

You may like this interesting post by SparkTutorials

这篇关于广播变量火花的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆