星火:通过广播变量执行人 [英] Spark: passing broadcast variable to executors

查看:158
本文介绍了星火:通过广播变量执行人的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我用下面的code通过广播变量来我所有的执行者。在code似乎工作,但我不知道如果我的做法是不够好。只是想看看是否有人有什么更好的建议。非常感谢你!

  VAL myRddMap = sc.textFile(input.txt中)图(T => myParser.parse(T))。
VAL myHashMapBroadcastVar = sparkContext.broadcast(myRddMap.collect()。toMap)

其中, myRddMap 的类型为 org.apache.spark.rdd.RDD [(字符串(字符串,字符串))]

然后,我有一个效用函数,我传递RDDS和变量,如:

  VAL myOutput = myUtiltityFunction.process(myRDD1,myHashMapBroadcastVar)


所以,以上code处理广播变量的好办法?或者是还有什么更好的办法?谢谢!


解决方案

  

广播变量允许程序员保持每台机器上一个只读变量缓存,而不是出货它的一个副本任务。


广播变量实际发送到所有节点。所以它并不重要您使用那些在实用功能,或任何地方。至于因为我认为你正在做正确的事情,似乎没有任何错误,导致表现欠佳。

I am passing a broadcast variable to all my executors using the following code. The code seems to work, but I don't know if my approach is good enough. Just want to see if anyone has any better suggestions. Thank you very much!

val myRddMap = sc.textFile("input.txt").map(t => myParser.parse(t))
val myHashMapBroadcastVar = sparkContext.broadcast(myRddMap.collect().toMap)

where myRddMap is of type org.apache.spark.rdd.RDD[(String, (String, String))]

Then I have a utility function which I pass in RDDs and variables like:

val myOutput = myUtiltityFunction.process(myRDD1, myHashMapBroadcastVar)


So is above code a good way for handling broadcast variables? Or is there any better approach? Thanks!

解决方案

Broadcast variables allow the programmer to keep a read-only variable cached on each machine rather than shipping a copy of it with tasks.

Broadcast variables are actually sent to all nodes. So it doesn't matter that you use those in a utility function, or anywhere. As for as I think you are doing the right thing, nothing seems wrong that resulted in a poor performance.

这篇关于星火:通过广播变量执行人的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆