在Spark Collaborative过滤RDD中将用户和项目功能保存到HDFS中 [英] Saving users and items features to HDFS in Spark Collaborative filtering RDD
问题描述
我想从使用Spark中的ALS进行协作过滤的结果中提取用户和项目功能(潜在因素)。我到目前为止的代码:
import org.apache.spark.mllib.recommendation.ALS
import org。 apache.spark.mllib.recommendation.MatrixFactorizationModel
import org.apache.spark.mllib.recommendation.Rating
//载入并解析数据
val data = sc.textFile (myhdfs / inputdirectory / als.data)
val ratings = data.map(_。split(',')match {case Array(user,item,rate)=>
Rating( user.toInt,item.toInt,rate.toDouble)
})
//使用ALS
val rank = 10
val numIterations = 10构建推荐模型
val model = ALS.train(rating,rank,numIterations,0.01)
//提取用户潜在因素
val users = model.userFeatures
//提取项目潜在因素
val项目= model.productFeatures
//保存到HDFS
users.saveAsTextFile(myhdfs / outputdirectory / users)//不按预期工作
items.saveAsTextFile(myhdfs / outputdirectory / items)//不能用作e xpected
然而,写入HDFS的内容并不是我所期望的。我期望每行都有一个元组(userId,Array_of_doubles)。相反,我看到以下内容:
[myname @ host dir] $ hadoop fs -cat myhdfs / outputdirectory / users / *
(1,[D @ 3c3137b5)
(3,[D @ 505d9755)
(4,[D @ 241a409a)
(2,[D @ c8c56dd)
。
。
它正在转储数组的哈希值而不是整个数组。我对 print
执行了以下操作:
for(user < - users){
val(userId,lf)= user
val str =user:+ userId +\ t+ lf.mkString()
println (str)
}
这会打印出我想要的内容,但我不能写到HDFS(这在控制台上打印)。
我应该怎么做把整个数组写入HDFS?
Spark版本是1.2.1。
@JohnTitusJungao是正确的,并且以下行按预期工作:
users.saveAsTextFile(myhdfs / outputdirectory / users)
items.saveAsTextFile(myhdfs / outputdirectory / items )
这就是原因, userFeatures
返回一个 RDD [(Int,Array [Double])]
。数组值由您在输出中看到的符号表示 [D @ 3c3137b5
, D
为double,接着是 @
和使用Java toString方法为这种类型的对象创建的十六进制代码。更多有关此处的信息。
val users:RDD [(Int,Array [Double])] = model.userFeatures
为了解决这个问题,你需要将数组作为一个字符串:
val users:RDD [(Int,String)] = model.userFeatures.mapValues(_。mkString(,))
。
I want to extract users and items features (latent factors) from the result of collaborative filtering using ALS in Spark. The code I have so far:
import org.apache.spark.mllib.recommendation.ALS
import org.apache.spark.mllib.recommendation.MatrixFactorizationModel
import org.apache.spark.mllib.recommendation.Rating
// Load and parse the data
val data = sc.textFile("myhdfs/inputdirectory/als.data")
val ratings = data.map(_.split(',') match { case Array(user, item, rate) =>
Rating(user.toInt, item.toInt, rate.toDouble)
})
// Build the recommendation model using ALS
val rank = 10
val numIterations = 10
val model = ALS.train(ratings, rank, numIterations, 0.01)
// extract users latent factors
val users = model.userFeatures
// extract items latent factors
val items = model.productFeatures
// save to HDFS
users.saveAsTextFile("myhdfs/outputdirectory/users") // does not work as expected
items.saveAsTextFile("myhdfs/outputdirectory/items") // does not work as expected
However, what gets written to HDFS is not what I expect. I expected each line to have a tuple (userId, Array_of_doubles). Instead I see the following:
[myname@host dir]$ hadoop fs -cat myhdfs/outputdirectory/users/*
(1,[D@3c3137b5)
(3,[D@505d9755)
(4,[D@241a409a)
(2,[D@c8c56dd)
.
.
It is dumping the hash value of the array instead of the entire array. I did the following to print
the desired values:
for (user <- users) {
val (userId, lf) = user
val str = "user:" + userId + "\t" + lf.mkString(" ")
println(str)
}
This does print what I want but I can't then write to HDFS (this prints on the console).
What should I do to get the complete array written to HDFS properly?
Spark version is 1.2.1.
@JohnTitusJungao is right and also the following lines works as expected :
users.saveAsTextFile("myhdfs/outputdirectory/users")
items.saveAsTextFile("myhdfs/outputdirectory/items")
And this is the reason, userFeatures
returns an RDD[(Int,Array[Double])]
. The array values are denoted by the symbols you see in the output e.g. [D@3c3137b5
, D
for double, followed by @
and hex code which is created using the Java toString method for this type of objects. More on that here.
val users: RDD[(Int, Array[Double])] = model.userFeatures
To solve that you'll need to make the array as a string :
val users: RDD[(Int, String)] = model.userFeatures.mapValues(_.mkString(","))
The same goes for items.
这篇关于在Spark Collaborative过滤RDD中将用户和项目功能保存到HDFS中的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!