在Spark Collaborative过滤RDD中将用户和项目功能保存到HDFS中 [英] Saving users and items features to HDFS in Spark Collaborative filtering RDD

查看:134
本文介绍了在Spark Collaborative过滤RDD中将用户和项目功能保存到HDFS中的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想从使用Spark中的ALS进行协作过滤的结果中提取用户和项目功能(潜在因素)。我到目前为止的代码:

  import org.apache.spark.mllib.recommendation.ALS 
import org。 apache.spark.mllib.recommendation.MatrixFactorizationModel
import org.apache.spark.mllib.recommendation.Rating

//载入并解析数据
val data = sc.textFile (myhdfs / inputdirectory / als.data)
val ratings = data.map(_。split(',')match {case Array(user,item,rate)=>
Rating( user.toInt,item.toInt,rate.toDouble)
})

//使用ALS
val rank = 10
val numIterations = 10构建推荐模型
val model = ALS.train(rating,rank,numIterations,0.01)

//提取用户潜在因素
val users = model.userFeatures

//提取项目潜在因素
val项目= model.productFeatures

//保存到HDFS
users.saveAsTextFile(myhdfs / outputdirectory / users)//不按预期工作
items.saveAsTextFile(myhdfs / outputdirectory / items)//不能用作e xpected

然而,写入HDFS的内容并不是我所期望的。我期望每行都有一个元组(userId,Array_of_doubles)。相反,我看到以下内容:

  [myname @ host dir] $ hadoop fs -cat myhdfs / outputdirectory / users / * 
(1,[D @ 3c3137b5)
(3,[D @ 505d9755)
(4,[D @ 241a409a)
(2,[D @ c8c56dd)


它正在转储数组的哈希值而不是整个数组。我对 print 执行了以下操作:

  for(user <  -  users){
val(userId,lf)= user
val str =user:+ userId +\ t+ lf.mkString()
println (str)
}

这会打印出我想要的内容,但我不能写到HDFS(这在控制台上打印)。



我应该怎么做把整个数组写入HDFS?

Spark版本是1.2.1。

解决方案

@JohnTitusJungao是正确的,并且以下行按预期工作:

  users.saveAsTextFile(myhdfs / outputdirectory / users)
items.saveAsTextFile(myhdfs / outputdirectory / items )

这就是原因, userFeatures 返回一个 RDD [(Int,Array [Double])] 。数组值由您在输出中看到的符号表示 [D @ 3c3137b5 D 为double,接着是 @ 和使用Java toString方法为这种类型的对象创建的十六进制代码。更多有关此处的信息。

  val users:RDD [(Int,Array [Double])] = model.userFeatures 

为了解决这个问题,你需要将数组作为一个字符串:

  val users:RDD [(Int,String)] = model.userFeatures.mapValues(_。mkString(,))


I want to extract users and items features (latent factors) from the result of collaborative filtering using ALS in Spark. The code I have so far:

import org.apache.spark.mllib.recommendation.ALS
import org.apache.spark.mllib.recommendation.MatrixFactorizationModel
import org.apache.spark.mllib.recommendation.Rating

// Load and parse the data
val data = sc.textFile("myhdfs/inputdirectory/als.data")
val ratings = data.map(_.split(',') match { case Array(user, item, rate) =>
  Rating(user.toInt, item.toInt, rate.toDouble)
})

// Build the recommendation model using ALS
val rank = 10
val numIterations = 10
val model = ALS.train(ratings, rank, numIterations, 0.01)

// extract users latent factors
val users = model.userFeatures

// extract items latent factors
val items = model.productFeatures

// save to HDFS
users.saveAsTextFile("myhdfs/outputdirectory/users") // does not work as expected
items.saveAsTextFile("myhdfs/outputdirectory/items") // does not work as expected

However, what gets written to HDFS is not what I expect. I expected each line to have a tuple (userId, Array_of_doubles). Instead I see the following:

[myname@host dir]$ hadoop fs -cat myhdfs/outputdirectory/users/*
(1,[D@3c3137b5)
(3,[D@505d9755)
(4,[D@241a409a)
(2,[D@c8c56dd)
.
.

It is dumping the hash value of the array instead of the entire array. I did the following to print the desired values:

for (user <- users) {
  val (userId, lf) = user
  val str = "user:" + userId + "\t" + lf.mkString(" ")
  println(str)
}

This does print what I want but I can't then write to HDFS (this prints on the console).

What should I do to get the complete array written to HDFS properly?

Spark version is 1.2.1.

解决方案

@JohnTitusJungao is right and also the following lines works as expected :

users.saveAsTextFile("myhdfs/outputdirectory/users") 
items.saveAsTextFile("myhdfs/outputdirectory/items")

And this is the reason, userFeatures returns an RDD[(Int,Array[Double])]. The array values are denoted by the symbols you see in the output e.g. [D@3c3137b5 , D for double, followed by @ and hex code which is created using the Java toString method for this type of objects. More on that here.

val users: RDD[(Int, Array[Double])] = model.userFeatures

To solve that you'll need to make the array as a string :

val users: RDD[(Int, String)] = model.userFeatures.mapValues(_.mkString(","))

The same goes for items.

这篇关于在Spark Collaborative过滤RDD中将用户和项目功能保存到HDFS中的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆