Scala/Spark中的比例矩阵 [英] Scale Matrix in Scala/Spark

查看:90
本文介绍了Scala/Spark中的比例矩阵的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有以下列表

id1, column_index1, value1
id2, column_index2, value2
...

我将其转换为索引行矩阵,并执行以下操作:

which I transformed to a indexed row matrix doing the following:

val data_mapped = data.map({ case (id, col, score) => (id, (col, score))})
val data_mapped_grouped = data_mapped.groupByKey
val indexed_rows = data_mapped_grouped.map({ case (id, vals) => IndexedRow(id, Vectors.sparse(nCols.value, vals.toSeq))})
val mat = new IndexedRowMatrix(indexed_rows)

我想对该矩阵进行一些预处理:从每一列中删除列的总和,通过其方差对每一列进行标准化. 我确实尝试使用内置的标准缩放器

I want to perform some preprocessing on this matrix: remove the sum of the columns from each column, standardize each column by its variance. I did try to use the built-in standard scaler

val scaler = new StandardScaler().fit(indexed_rows.map(x => x.features))

但是使用IndexedRow类型似乎无法实现

but this doesn't seem to be possible with IndexedRow type

感谢您的帮助!

推荐答案

根据我对问题的理解,这是在IndexedRow

According to what I understood from your question, here is what you'll need to do to perform StandardScaler fit on your IndexedRow

import org.apache.spark.mllib.feature.{StandardScaler, StandardScalerModel}
import org.apache.spark.mllib.linalg.distributed.IndexedRow
import org.apache.spark.mllib.linalg.{Vector, Vectors}
import org.apache.spark.rdd.RDD

val data: RDD[(Int, Int, Double)] = ???

object nCol {
  val value: Int = ???
}

val data_mapped: RDD[(Int, (Int, Double))] = 
    data.map({ case (id, col, score) => (id, (col, score)) })
val data_mapped_grouped: RDD[(Int, Iterable[(Int, Double)])] = 
    data_mapped.groupByKey

val indexed_rows: RDD[IndexedRow] = data_mapped_grouped.map { 
       case (id, vals) => 
       IndexedRow(id, Vectors.sparse(nCol.value, vals.toSeq)) 
}

您可以使用简单的地图从IndexedRow获取向量

You can get your vectors from your IndexedRow with a simple map

val vectors: RDD[Vector] = indexed_rows.map { case i: IndexedRow => i.vector }

现在您有了RDD [Vector],您可以尝试将其与洁牙机配合使用.

Now that you have an RDD[Vector] you can try to fit it with your scaler.

val scaler: StandardScalerModel = new StandardScaler().fit(vectors)

我希望这会有所帮助!

这篇关于Scala/Spark中的比例矩阵的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆