如何建立一个大型分布式[疏] Apache中的星火1.0矩阵? [英] How to build a large distributed [sparse] matrix in Apache Spark 1.0?
问题描述
我有一个RDD这样: byUserHour:org.apache.spark.rdd.RDD [(字符串,字符串,整数)]
我想创建一个稀疏像中值计算的数据的矩阵,平均等。RDD包含ROW_ID,COLUMN_ID和价值。我有一个包含了查找的ROW_ID和COLUMN_ID字符串一个数组。
I have an RDD as such: byUserHour: org.apache.spark.rdd.RDD[(String, String, Int)]
I would like to create a sparse matrix of the data for calculations like median, mean, etc. The RDD contains the row_id, column_id and value. I have two Arrays containing the row_id and column_id strings for lookup.
下面是我的尝试:
import breeze.linalg._
val builder = new CSCMatrix.Builder[Int](rows=BCnUsers.value.toInt,cols=broadcastTimes.value.size)
byUserHour.foreach{x =>
val row = userids.indexOf(x._1)
val col = broadcastTimes.value.indexOf(x._2)
builder.add(row,col,x._3)}
builder.result()
下面是我的错误:
14/06/10 16:39:34 INFO DAGScheduler: Failed to run foreach at <console>:38
org.apache.spark.SparkException: Job aborted due to stage failure: Task not serializable: java.io.NotSerializableException: breeze.linalg.CSCMatrix$Builder$mcI$sp
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1033)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1017)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1015)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1015)
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitMissingTasks(DAGScheduler.scala:770)
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitStage(DAGScheduler.scala:713)
at org.apache.spark.scheduler.DAGScheduler.handleJobSubmitted(DAGScheduler.scala:697)
at org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1176)
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
at akka.actor.ActorCell.invoke(ActorCell.scala:456)
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
at akka.dispatch.Mailbox.run(Mailbox.scala:219)
at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
我的数据是相当大的,所以我想这样做如果可能,这种分布式的。任何帮助将是AP preciated。
My dataset is quite large so I would like to do this distributed if possible. Any help would be appreciated.
的最新进展:的
CSCMartix并不意味着以火花工作。然而,有<一个href=\"http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.mllib.linalg.distributed.RowMatrix\"相对=nofollow> RowMatrix 延伸 DistributedMatrix
。 RowMatrix
确实有一个方法, computeColumnSummaryStatistics()
,应该计算一些我要找的统计资料。我知道MLlib是每天越来越多,所以我会看更新,但在此期间,我会尽量让一个 RDD [矢量]
饲料 RowMatrix
。注意到 RowMatrix
是实验性的,并重新presents没有意义的行索引面向行的分布式矩阵。
CSCMartix is not meant to work in Spark. However, there is RowMatrix which extends DistributedMatrix
. RowMatrix
does have a method, computeColumnSummaryStatistics()
, that should compute some of the stats I am looking for. I know MLlib is growing everyday so I will watch for updates, but in the meantime I will try to make an RDD[Vector]
to feed RowMatrix
. Noting that RowMatrix
is experimental and represents a row-oriented distributed Matrix with no meaningful row indices.
推荐答案
与映射有点不同byUserHour开始现在是一个 RDD [(字符串(字符串,整数))]
由于RowMatrix没有行我就ROW_ID groupByKey的preserve秩序。也许将来我会弄清楚如何与一个稀疏矩阵做到这一点。
Starting with mapping a little different byUserHour is now an RDD[(String, (String, Int))]
.
Because RowMatrix does not preserve order of rows I groupByKey on the row_id. Perhaps in the future I will figure out how to do this with a sparse matrix.
val byUser = byUserHour.groupByKey // RDD[(String, Iterable[(String, Int)])]
val times = countHour.map(x => x._1.split("\\+")(1)).distinct.collect.sortWith(_ < _) // Array[String]
val broadcastTimes = sc.broadcast(times) // Broadcast[Array[String]]
val userMaps = byUser.mapValues {
x => x.map{
case(time,cnt) => time -> cnt
}.toMap
} // RDD[(String, scala.collection.immutable.Map[String,Int])]
val rows = userMaps.map {
case(u,ut) => (u.toDouble +: broadcastTimes.value.map(ut.getOrElse(_,0).toDouble))} // RDD[Array[Double]]
import org.apache.spark.mllib.linalg.{Vector, Vectors}
val rowVec = rows.map(x => Vectors.dense(x)) // RDD[org.apache.spark.mllib.linalg.Vector]
import org.apache.spark.mllib.linalg.distributed._
val countMatrix = new RowMatrix(rowVec)
val stats = countMatrix.computeColumnSummaryStatistics()
val meanvec = stats.mean
这篇关于如何建立一个大型分布式[疏] Apache中的星火1.0矩阵?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!