如果用户 ID 是字符串而不是连续整数,如何使用 mllib.recommendation? [英] How to use mllib.recommendation if the user ids are string instead of contiguous integers?

查看:15
本文介绍了如果用户 ID 是字符串而不是连续整数,如何使用 mllib.recommendation?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想使用 Spark 的 mllib.recommendation 库来构建一个原型推荐系统.但是,我拥有的用户数据的格式是以下格式:

I want to use Spark's mllib.recommendation library to build a prototype recommender system. However, the format of the user data I have is something of the following format:

AB123XY45678
CD234WZ12345
EF345OOO1234
GH456XY98765
....

如果我想使用 mllib.recommendation 库,根据 Rating 类的 API,用户 ID 必须是整数(也必须是连续的)?)

If I want to use the mllib.recommendation library, according to the API of the Rating class, the user ids have to be integers (also have to be contiguous?)

看起来必须在真实用户 ID 和 Spark 使用的数字 ID 之间进行某种转换.但是我该怎么做呢?

It looks like some kind of conversion between the real user ids and the numeric ones used by Spark must be done. But how should I do this?

推荐答案

Spark 并不真正需要数字 id,它只需要蜜蜂一些唯一的值,但为了实现,他们选择了 Int.

Spark don't really require numeric id, it just needs to bee some unique value, but for implementation they picked Int.

您可以对 userId 进行简单的来回转换:

You can do simple back and forth transformation for userId:

  case class MyRating(userId: String, product: Int, rating: Double)

  val data: RDD[MyRating] = ???

  // Assign unique Long id for each userId
  val userIdToInt: RDD[(String, Long)] = 
    data.map(_.userId).distinct().zipWithUniqueId()

  // Reverse mapping from generated id to original
  val reverseMapping: RDD[(Long, String)]
    userIdToInt map { case (l, r) => (r, l) }

  // Depends on data size, maybe too big to keep
  // on single machine
  val map: Map[String, Int] = 
    userIdToInt.collect().toMap.mapValues(_.toInt)

  // Transform to MLLib rating
  val rating: RDD[Rating] = data.map { r =>
    Rating(userIdToInt.lookup(r.userId).head.toInt, r.product, r.rating)
    // -- or
    Rating(map(r.userId), r.product, r.rating)
  }

  // ... train model

  // ... get back to MyRating userId from Int

  val someUserId: String = reverseMapping.lookup(123).head

您也可以尝试data.zipWithUniqueId()",但我不确定在这种情况下,即使数据集很小,.toInt 也会是安全的转换.

You can also try 'data.zipWithUniqueId()' but I'm not sure that in this case .toInt will be safe transformation even if dataset size is small.

这篇关于如果用户 ID 是字符串而不是连续整数,如何使用 mllib.recommendation?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆