如何在PySpark ALS中使用长用户ID [英] How to use long user ID in PySpark ALS

查看:195
本文介绍了如何在PySpark ALS中使用长用户ID的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试在PySpark MLlib(1.3.1)的ALS模型中使用长用户名/产品ID,但是遇到了问题.此处提供了代码的简化版本:

I am attempting to use long user/product IDs in the ALS model in PySpark MLlib (1.3.1) and have run into an issue. A simplified version of the code is given here:

from pyspark import SparkContext
from pyspark.mllib.recommendation import ALS, Rating

sc = SparkContext("","test")

# Load and parse the data
d = [ "3661636574,1,1","3661636574,2,2","3661636574,3,3"]
data = sc.parallelize(d)
ratings = data.map(lambda l: l.split(',')).map(lambda l: Rating(long(l[0]), long(l[1]), float(l[2])) )

# Build the recommendation model using Alternating Least Squares
rank = 10
numIterations = 20
model = ALS.train(ratings, rank, numIterations)

运行此代码将产生java.lang.ClassCastException,因为该代码正在尝试将long转换为整数.查看源代码, ml ALS类允许使用较长的用户/产品ID,但随后

Running this code yields a java.lang.ClassCastException because the code is attempting to convert the longs to integers. Looking through the source code, the ml ALS class in Spark allows for long user/product IDs but then the mllib ALS class forces the use of ints.

问题:是否有解决方法在PySpark ALS中使用长用户名/产品ID?

Question: Is there a workaround to use long user/product IDs in PySpark ALS?

推荐答案

这是已知问题(

This is known issue (https://issues.apache.org/jira/browse/SPARK-2465), but it will not be solved soon, because changing interface to long userId should slowdown computation.

解决方案很少:

  • 您可以使用hash()函数将userId哈希为int,因为在少数情况下,它仅会导致随机行压缩,因此冲突实际上不会影响您推荐程序的准确性.在第一个链接中进行讨论.

  • you can hash userId to int with hash() function, since it cause just random row compression in few cases, collisions shouldn't affect accuracy of your recommender, really. Discussion in first link.

您可以使用RDD.zipWithUniqueId()或更短的RDD.zipWithIndex生成唯一的int用户ID,就像在此线程中一样:

you can generate unique int userIds with RDD.zipWithUniqueId() or less fast RDD.zipWithIndex, just like in this thread: How to assign unique contiguous numbers to elements in a Spark RDD

这篇关于如何在PySpark ALS中使用长用户ID的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆