如果用户 ID 是字符串而不是连续整数,如何使用 mllib.recommendation? [英] How to use mllib.recommendation if the user ids are string instead of contiguous integers?
问题描述
我想使用 Spark 的 mllib.recommendation
库来构建一个原型推荐系统.但是,我拥有的用户数据的格式是以下格式:
I want to use Spark's mllib.recommendation
library to build a prototype recommender system. However, the format of the user data I have is something of the following format:
AB123XY45678
CD234WZ12345
EF345OOO1234
GH456XY98765
....
如果我想使用 mllib.recommendation
库,根据 Rating
类的 API,用户 ID 必须是整数(也必须是连续的)?)
If I want to use the mllib.recommendation
library, according to the API of the Rating
class, the user ids have to be integers (also have to be contiguous?)
看起来必须在真实用户 ID 和 Spark 使用的数字 ID 之间进行某种转换.但是我该怎么做呢?
It looks like some kind of conversion between the real user ids and the numeric ones used by Spark must be done. But how should I do this?
推荐答案
Spark 并不真正需要数字 id,它只需要蜜蜂一些唯一的值,但为了实现,他们选择了 Int.
Spark don't really require numeric id, it just needs to bee some unique value, but for implementation they picked Int.
您可以对 userId 进行简单的来回转换:
You can do simple back and forth transformation for userId:
case class MyRating(userId: String, product: Int, rating: Double)
val data: RDD[MyRating] = ???
// Assign unique Long id for each userId
val userIdToInt: RDD[(String, Long)] =
data.map(_.userId).distinct().zipWithUniqueId()
// Reverse mapping from generated id to original
val reverseMapping: RDD[(Long, String)]
userIdToInt map { case (l, r) => (r, l) }
// Depends on data size, maybe too big to keep
// on single machine
val map: Map[String, Int] =
userIdToInt.collect().toMap.mapValues(_.toInt)
// Transform to MLLib rating
val rating: RDD[Rating] = data.map { r =>
Rating(userIdToInt.lookup(r.userId).head.toInt, r.product, r.rating)
// -- or
Rating(map(r.userId), r.product, r.rating)
}
// ... train model
// ... get back to MyRating userId from Int
val someUserId: String = reverseMapping.lookup(123).head
您也可以尝试data.zipWithUniqueId()",但我不确定在这种情况下,即使数据集很小,.toInt 也会是安全的转换.
You can also try 'data.zipWithUniqueId()' but I'm not sure that in this case .toInt will be safe transformation even if dataset size is small.
这篇关于如果用户 ID 是字符串而不是连续整数,如何使用 mllib.recommendation?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!