Spark mllib:如何将字符串分类特征转换为 int 以供评级接受 [英] Spark mllib : how to convert string categorical features into int for Rating to accept

查看:42
本文介绍了Spark mllib:如何将字符串分类特征转换为 int 以供评级接受的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想使用 spark mllib 和协同过滤技术中的 ALS 算法构建一个推荐应用程序.我的数据集具有字符串形式的用户和产品特征,例如:

I want to build a recommendation application using spark mllib and the ALS algorithm in collaborative filtering technique. My data set has the user and product features in string form like :

[{"user":"StringName1", "product":"StringProduct1", "rating":1},
 {"user":"StringName2", "product":"StringProduct2", "rating":2},
 {"user":"StringName1", "product":"StringProduct2", "rating":3},..]

但是 Rating 方法似乎只接受用户和产品功能的 int 值.这是否意味着我必须构建一个单独的字典来将每个字符串映射到一个 int?我的数据集将包含用户和产品的重复条目.mllib 库本身是否有针对此的内置解决方案?

But the Rating method seems to accept only int values for both user and product features. Does that mean I will have to build a separate dictionary to map each string to an int? My dataset will have duplicate entries for both user and product.Is there a built-in solution for this in the mllib library itself?

感谢并感谢您的帮助!

不,这不是重复的,因为该问题的答案似乎不适合我的情况.spark.ml.recommendation.ALS.Rating 库似乎不支持 useritem 的字符串值.我需要这种支持.

No, this is not a duplicate as the answer in that question doesn't seem to fit my scenario. spark.ml.recommendation.ALS.Rating library doesn't seem to support String values for user or item. I need this support.

推荐答案

让我试试.假设 data: RDD[(String, String, Float)]

import org.apache.spark.mllib.recommendation.Rating

val data = sc.parallelize(Array(("StringName1", "StringProduct1", 1.0), ("StringName2", "StringProduct2", 2.0), ("StringName3", "StringProduct3", 3.0)))

//get distinct names and products and create maps from them
val names = data.map(_._1).distinct.sortBy(x => x).zipWithIndex.collectAsMap
val products = data.map(_._2).distinct.sortBy(x => x).zipWithIndex.collectAsMap

//convert to Rating format
val data_rating = data.map(r => Rating(names(r._1).toInt, products(r._2).toInt, r._3))

应该可以.基本上,您只需创建一个从 string 到 long 的映射,然后将 long 转换为 int.

That should do it. Basically, you just create a mapping from string to long and then convert long to int.

这篇关于Spark mllib:如何将字符串分类特征转换为 int 以供评级接受的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆