Spark-评级文件中的相关矩阵 [英] Spark - correlation matrix from file of ratings

查看:96
本文介绍了Spark-评级文件中的相关矩阵的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我对Scala和Spark还是很陌生,并且无法从评分文件中创建相关矩阵.它类似于这个问题,但是我有矩阵形式的稀疏数据.我的数据如下:

I'm pretty new to Scala and Spark and I'm not able to create a correlation matrix from a file of ratings. It's similar to this question but I have sparse data in the matrix form. My data looks like this:

<user-id>, <rating-for-movie-1-or-null>, ... <rating-for-movie-n-or-null>

123, , , 3, , 4.5
456, 1, 2, 3, , 4
...

到目前为止最有前途的代码如下:

The code that is most promising so far looks like this:

val corTest = sc.textFile("data/collab_filter_data.txt").map(_.split(","))
Statistics.corr(corTest, "pearson")

(我知道其中的user_ids有缺陷,但是我现在愿意忍受它)

(I know the user_ids in there are a defect, but I'm willing to live with that for the moment)

我期望输出如下:

1,   .123, .345
.123, 1,   .454
.345, .454, 1

这是一个矩阵,显示每个用户如何与每个其他用户相关联.从图形上来说,这将是一个相关图.

It's a matrix showing how each user is correlated to every other user. Graphically, it would be a correlogram.

这是一个完全的菜鸟问题,但我已经与它进行了几个小时的战斗,谷歌似乎无法摆脱它.

It's a total noob problem but I've been fighting with it for a few hours and can't seem to Google my way out of it.

推荐答案

我相信这段代码可以完成您想要的操作:

I believe this code should accomplish what you want:

import org.apache.spark.mllib.stat.Statistics
import org.apache.spark.mllib.linalg._
...
val corTest = input.map { case (line: String) => 
  val split = line.split(",").drop(1)
  split.map(elem => if (elem.trim.isEmpty) 0.0 else elem.toDouble)
}.map(arr => Vectors.dense(arr))

val corrMatrix = Statistics.corr(corTest)

在这里,我们正在将您的输入映射到String数组中,删除用户id元素,将空白归零,最后从结果数组中创建密集向量.另外,请注意,如果未提供任何方法,则默认使用Pearson方法.

Here, we are mapping your input into a String array, dropping the user id element, zeroing out your whitespace, and finally creating a dense vector from the resultant array. Also, note that Pearson's method is used by default if no method is supplied.

在带有一些示例的shell中运行时,我看到以下内容:

When run in shell with some examples, I see the following:

scala> val input = sc.parallelize(Array("123, , , 3, , 4.5", "456, 1, 2, 3, , 4", "789, 4, 2.5, , 0.5, 4", "000, 5, 3.5, , 4.5, "))
input: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[18] at parallelize at <console>:16

scala> val corTest = ...
corTest: org.apache.spark.rdd.RDD[org.apache.spark.mllib.linalg.Vector] = MappedRDD[20] at map at <console>:18

scala> val corrMatrix = Statistics.corr(corTest)
...
corrMatrix: org.apache.spark.mllib.linalg.Matrix =
1.0                  0.9037378388935388   -0.9701425001453317  ... (5 total)
0.9037378388935388   1.0                  -0.7844645405527361  ...
-0.9701425001453317  -0.7844645405527361  1.0                  ...
0.7709910794438823   0.7273340668525836   -0.6622661785325219  ...
-0.7513578452729373  -0.7560667258329613  0.6195855517393626   ...

这篇关于Spark-评级文件中的相关矩阵的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆