在spark上使用map函数更新变量 [英] update variables using map function on spark
问题描述
这是我的代码:
val dataRDD = sc.textFile(args(0)).map(line => line.split(" ")).map(x => Array(x(0).toInt, x(1).toInt, x(2).toInt))
var arr = new Array[Int](3)
printArr(arr)
dataRDD.map(x => {arr=x})
printArr(arr)
此代码无法正常工作.如何使它成功运行?
This code is not working properly. How can i make it work successfully?
推荐答案
好的,因此对RDD的操作是由不同的工作程序(通常在集群中的不同计算机上)并行执行的,因此您不能传递这种类型的全局要更新的对象 arr
.您会看到,每个工作人员将收到他们自己的 arr
副本,他们将对其进行更新,但驱动程序永远不会知道.
Okay, so operations on RDDs are performed in parallel by different workers (usually on different machines in the cluster) and therefore you cannot pass in this type of "global" object arr
to be updated. You see, each worker will receive their own copy of arr
which they will update, but the driver will never know.
我猜测您要在此处执行的操作是从RDD收集所有数组,您可以通过简单的 collect
操作进行此操作:
I'm guessing that what you want to do here is to collect all the arrays from the RDD, which you can do with a simple collect
action:
val dataRDD = sc.textFile(args(0)).map(line => line.split(" ")).map(x => Array(x(0).toInt, x(1).toInt, x(2).toInt))
val arr = dataRDD.collect()
其中 arr
的类型为 Array [Array [Int]]
.然后,您可以使用常规数组操作遍历 arr
.
Where arr
has type Array[Array[Int]]
. You can then run through arr
with normal array operations.
这篇关于在spark上使用map函数更新变量的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!