Apache Spark的RDD [Vector]不变性问题 [英] Apache Spark's RDD[Vector] Immutability issue
本文介绍了Apache Spark的RDD [Vector]不变性问题的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!
问题描述
我知道RDD是不可变的,因此不能更改其值,但是我看到以下行为:
I know the RDDs are immutable and therefore their value cannot be changed but I see the following behaviour:
我写了FuzzyCMeans( https://github.com/salexln/FinalProject_FCM )算法的实现现在我正在测试它,所以我运行以下示例:
I wrote an implementation for FuzzyCMeans (https://github.com/salexln/FinalProject_FCM) algorithm and now I'm testing it, so I run the following example:
import org.apache.spark.mllib.clustering.FuzzyCMeans
import org.apache.spark.mllib.linalg.Vectors
val data = sc.textFile("/home/development/myPrjects/R/butterfly/butterfly.txt")
val parsedData = data.map(s => Vectors.dense(s.split(' ').map(_.toDouble))).cache()
> parsedData: org.apache.spark.rdd.RDD[org.apache.spark.mllib.linalg.Vector] = MapPartitionsRDD[2] at map at <console>:31
val numClusters = 2
val numIterations = 20
parsedData.foreach{ point => println(point) }
> [0.0,-8.0]
[-3.0,-2.0]
[-3.0,0.0]
[-3.0,2.0]
[-2.0,-1.0]
[-2.0,0.0]
[-2.0,1.0]
[-1.0,0.0]
[0.0,0.0]
[1.0,0.0]
[2.0,-1.0]
[2.0,0.0]
[2.0,1.0]
[3.0,-2.0]
[3.0,0.0]
[3.0,2.0]
[0.0,8.0]
val clusters = FuzzyCMeans.train(parsedData, numClusters, numIteration
parsedData.foreach{ point => println(point) }
>
[0.0,-0.4803333185624595]
[-0.1811743096972924,-0.12078287313152826]
[-0.06638890786148487,0.0]
[-0.04005925925925929,0.02670617283950619]
[-0.12193263222069807,-0.060966316110349035]
[-0.0512,0.0]
[NaN,NaN]
[-0.049382716049382706,0.0]
[NaN,NaN]
[0.006830134553650707,0.0]
[0.05120000000000002,-0.02560000000000001]
[0.04755220304297078,0.0]
[0.06581619798335057,0.03290809899167529]
[0.12010867103812725,-0.0800724473587515]
[0.10946638900458144,0.0]
[0.14814814814814817,0.09876543209876545]
[0.0,0.49119985188436205]
但是我的方法如何更改不可变RDD?
But how can this be that my method changes the Immutable RDD?
BTW(火车方法的签名)如下:
BTW, the signature of the train method, is the following:
train(数据:RDD [Vector],簇:Int,maxIterations:Int)
train( data: RDD[Vector], clusters: Int, maxIterations: Int)
推荐答案
查看全文