在Spark Collaborative过滤RDD中将用户和项目功能保存到HDFS中 [英] Saving users and items features to HDFS in Spark Collaborative filtering RDD

查看：134 发布时间：2018/6/6 11:18:37 arrays apache-spark hdfs rdd

本文介绍了在Spark Collaborative过滤RDD中将用户和项目功能保存到HDFS中的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我想从使用Spark中的ALS进行协作过滤的结果中提取用户和项目功能（潜在因素）。我到目前为止的代码：

  import org.apache.spark.mllib.recommendation.ALS 
 import org。 apache.spark.mllib.recommendation.MatrixFactorizationModel 
 import org.apache.spark.mllib.recommendation.Rating 
 
 //载入并解析数据
 val data = sc.textFile （myhdfs / inputdirectory / als.data）
 val ratings = data.map（_。split（'，'）match {case Array（user，item，rate）=> 
 Rating（ user.toInt，item.toInt，rate.toDouble）
}）
 
 //使用ALS 
 val rank = 10 
 val numIterations = 10构建推荐模型
 val model = ALS.train（rating，rank，numIterations，0.01）
 
 //提取用户潜在因素
 val users = model.userFeatures 
 
 //提取项目潜在因素
 val项目= model.productFeatures 
 
 //保存到HDFS 
 users.saveAsTextFile（myhdfs / outputdirectory / users）//不按预期工作
 items.saveAsTextFile（myhdfs / outputdirectory / items）//不能用作e xpected

然而，写入HDFS的内容并不是我所期望的。我期望每行都有一个元组（userId，Array_of_doubles）。相反，我看到以下内容：

  [myname @ host dir] $ hadoop fs -cat myhdfs / outputdirectory / users / * 
（1，[D @ 3c3137b5）
（3，[D @ 505d9755）
（4，[D @ 241a409a）
（2，[D @ c8c56dd）
 。 
。

它正在转储数组的哈希值而不是整个数组。我对 print 执行了以下操作：

  for（user <  -  users）{
 val（userId，lf）= user 
 val str =user：+ userId +\ t+ lf.mkString（）
 println （str）
}

这会打印出我想要的内容，但我不能写到HDFS（这在控制台上打印）。

我应该怎么做把整个数组写入HDFS？

Spark版本是1.2.1。
解决方案
@JohnTitusJungao是正确的，并且以下行按预期工作：

users.saveAsTextFile（myhdfs / outputdirectory / users） items.saveAsTextFile（myhdfs / outputdirectory / items ）
这就是原因， userFeatures 返回一个 RDD [（Int，Array [Double]）] 。数组值由您在输出中看到的符号表示 [D @ 3c3137b5 ， D 为double，接着是 @ 和使用Java toString方法为这种类型的对象创建的十六进制代码。更多有关此处的信息。
val users：RDD [（Int，Array [Double]）] = model.userFeatures
为了解决这个问题，你需要将数组作为一个字符串：

val users：RDD [（Int，String）] = model.userFeatures.mapValues（_。mkString（，））
。

I want to extract users and items features (latent factors) from the result of collaborative filtering using ALS in Spark. The code I have so far:
import org.apache.spark.mllib.recommendation.ALS import org.apache.spark.mllib.recommendation.MatrixFactorizationModel import org.apache.spark.mllib.recommendation.Rating // Load and parse the data val data = sc.textFile("myhdfs/inputdirectory/als.data") val ratings = data.map(_.split(',') match { case Array(user, item, rate) => Rating(user.toInt, item.toInt, rate.toDouble) }) // Build the recommendation model using ALS val rank = 10 val numIterations = 10 val model = ALS.train(ratings, rank, numIterations, 0.01) // extract users latent factors val users = model.userFeatures // extract items latent factors val items = model.productFeatures // save to HDFS users.saveAsTextFile("myhdfs/outputdirectory/users") // does not work as expected items.saveAsTextFile("myhdfs/outputdirectory/items") // does not work as expected
However, what gets written to HDFS is not what I expect. I expected each line to have a tuple (userId, Array_of_doubles). Instead I see the following:
[myname@host dir]$ hadoop fs -cat myhdfs/outputdirectory/users/* (1,[D@3c3137b5) (3,[D@505d9755) (4,[D@241a409a) (2,[D@c8c56dd) . .
It is dumping the hash value of the array instead of the entire array. I did the following to print the desired values:
for (user <- users) { val (userId, lf) = user val str = "user:" + userId + "\t" + lf.mkString(" ") println(str) }
This does print what I want but I can't then write to HDFS (this prints on the console).

What should I do to get the complete array written to HDFS properly?

Spark version is 1.2.1.
解决方案
@JohnTitusJungao is right and also the following lines works as expected :
users.saveAsTextFile("myhdfs/outputdirectory/users") items.saveAsTextFile("myhdfs/outputdirectory/items")
And this is the reason, userFeatures returns an RDD[(Int,Array[Double])]. The array values are denoted by the symbols you see in the output e.g. [D@3c3137b5 , D for double, followed by @ and hex code which is created using the Java toString method for this type of objects. More on that here.
val users: RDD[(Int, Array[Double])] = model.userFeatures
To solve that you'll need to make the array as a string :
val users: RDD[(Int, String)] = model.userFeatures.mapValues(_.mkString(","))
The same goes for items.

这篇关于在Spark Collaborative过滤RDD中将用户和项目功能保存到HDFS中的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

在Spark Collaborative过滤RDD中将用户和项目功能保存到HDFS中 [英] Saving users and items features to HDFS in Spark Collaborative filtering RDD

问题描述

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

在Spark Collaborative过滤RDD中将用户和项目功能保存到HDFS中 [英] Saving users and items features to HDFS in Spark Collaborative filtering RDD

问题描述

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭