Spark mllib:如何将字符串分类特征转换为int以供Rating接受 [英] Spark mllib : how to convert string categorical features into int for Rating to accept
问题描述
我想在协作过滤技术中使用spark mllib和ALS算法构建推荐应用程序.我的数据集具有字符串形式的用户和产品功能,例如:
I want to build a recommendation application using spark mllib and the ALS algorithm in collaborative filtering technique. My data set has the user and product features in string form like :
[{"user":"StringName1", "product":"StringProduct1", "rating":1},
{"user":"StringName2", "product":"StringProduct2", "rating":2},
{"user":"StringName1", "product":"StringProduct2", "rating":3},..]
但是 Rating 方法似乎只接受用户和产品功能的int值.这是否意味着我将必须构建一个单独的字典来将每个字符串映射到一个int?我的数据集将包含用户和产品的重复条目.mllib库本身是否有内置的解决方案?
But the Rating method seems to accept only int values for both user and product features. Does that mean I will have to build a separate dictionary to map each string to an int? My dataset will have duplicate entries for both user and product.Is there a built-in solution for this in the mllib library itself?
感谢和任何帮助!
不,这不是重复的,因为该问题的答案似乎不适合我的情况. spark.ml.recommendation.ALS.Rating
库似乎不支持user
或item
的字符串值.我需要这种支持.
No, this is not a duplicate as the answer in that question doesn't seem to fit my scenario. spark.ml.recommendation.ALS.Rating
library doesn't seem to support String values for user
or item
. I need this support.
推荐答案
让我尝试.假设data: RDD[(String, String, Float)]
import org.apache.spark.mllib.recommendation.Rating
val data = sc.parallelize(Array(("StringName1", "StringProduct1", 1.0), ("StringName2", "StringProduct2", 2.0), ("StringName3", "StringProduct3", 3.0)))
//get distinct names and products and create maps from them
val names = data.map(_._1).distinct.sortBy(x => x).zipWithIndex.collectAsMap
val products = data.map(_._2).distinct.sortBy(x => x).zipWithIndex.collectAsMap
//convert to Rating format
val data_rating = data.map(r => Rating(names(r._1).toInt, products(r._2).toInt, r._3))
应该这样做.基本上,您只需创建从字符串到long的映射,然后将long转换为int.
That should do it. Basically, you just create a mapping from string to long and then convert long to int.
这篇关于Spark mllib:如何将字符串分类特征转换为int以供Rating接受的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!