星火:通过广播变量执行人 [英] Spark: passing broadcast variable to executors
问题描述
我用下面的code通过广播变量来我所有的执行者。在code似乎工作,但我不知道如果我的做法是不够好。只是想看看是否有人有什么更好的建议。非常感谢你!
VAL myRddMap = sc.textFile(input.txt中)图(T => myParser.parse(T))。
VAL myHashMapBroadcastVar = sparkContext.broadcast(myRddMap.collect()。toMap)
其中, myRddMap
的类型为 org.apache.spark.rdd.RDD [(字符串(字符串,字符串))]
然后,我有一个效用函数,我传递RDDS和变量,如:
VAL myOutput = myUtiltityFunction.process(myRDD1,myHashMapBroadcastVar)
所以,以上code处理广播变量的好办法?或者是还有什么更好的办法?谢谢!
广播变量允许程序员保持每台机器上一个只读变量缓存,而不是出货它的一个副本任务。
块引用>广播变量实际发送到所有节点。所以它并不重要您使用那些在实用功能,或任何地方。至于因为我认为你正在做正确的事情,似乎没有任何错误,导致表现欠佳。
I am passing a broadcast variable to all my executors using the following code. The code seems to work, but I don't know if my approach is good enough. Just want to see if anyone has any better suggestions. Thank you very much!
val myRddMap = sc.textFile("input.txt").map(t => myParser.parse(t)) val myHashMapBroadcastVar = sparkContext.broadcast(myRddMap.collect().toMap)
where
myRddMap
is of typeorg.apache.spark.rdd.RDD[(String, (String, String))]
Then I have a utility function which I pass in RDDs and variables like:
val myOutput = myUtiltityFunction.process(myRDD1, myHashMapBroadcastVar)
So is above code a good way for handling broadcast variables? Or is there any better approach? Thanks!
解决方案Broadcast variables allow the programmer to keep a read-only variable cached on each machine rather than shipping a copy of it with tasks.
Broadcast variables are actually sent to all nodes. So it doesn't matter that you use those in a utility function, or anywhere. As for as I think you are doing the right thing, nothing seems wrong that resulted in a poor performance.
这篇关于星火:通过广播变量执行人的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!