ALS 模型 - 预测 full_u * v^t * v 评分非常高 [英] ALS model - predicted full_u * v^t * v ratings are very high

查看:20
本文介绍了ALS 模型 - 预测 full_u * v^t * v 评分非常高的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在预测批量训练模型的过程之间的评分.我正在使用此处概述的方法:ALS 模型 - 如何生成 full_u * v^t * v?

!rm -rf ml-1m.zip ml-1m!wget --quiet http://files.grouplens.org/datasets/movielens/ml-1m.zip!解压 ml-1m.zip!mv ml-1m/ratings.dat.从 pyspark.mllib.recommendation 导入评级ratingsRDD = sc.textFile('ratings.dat') \.map(lambda l: l.split("::")) \.map(lambda p: 评分(用户 = int(p[0]),产品 = int(p[1]),评级 = 浮动(p[2]),)).缓存()从 pyspark.mllib.recommendation 导入 ALS等级 = 50numIterations = 20拉姆达参数 = 0.1模型 = ALS.train( ratingsRDD, rank, numIterations, lambdaParam)

然后提取产品特征...

导入json将 numpy 导入为 nppf = 模型.productFeatures()pf_vals = pf.sortByKey().values().collect()pf_keys = pf.sortByKey().keys().collect()Vt = np.matrix(np.asarray(pf_vals))full_u = np.zeros(len(pf_keys))def set_rating(pf_keys, full_u, key, val):尝试:idx = pf_keys.index(key)full_u.itemset(idx, val)除了:经过set_rating(pf_keys, full_u, 260, 9), # 星球大战 (1977)set_rating(pf_keys, full_u, 1, 8), # 玩具总动员 (1995)set_rating(pf_keys, full_u, 16, 7), # Casino (1995)set_rating(pf_keys, full_u, 25, 8), # 离开拉斯维加斯 (1995)set_rating(pf_keys, full_u, 32, 9), # 十二只猴子 (又名 12 只猴子) (1995)set_rating(pf_keys, full_u, 335, 4), # Flintstones, The (1994)set_rating(pf_keys, full_u, 379, 3), # Timecop (1994)set_rating(pf_keys, full_u, 296, 7), # 低俗小说 (1994)set_rating(pf_keys, full_u, 858, 10), # Godfather, The (1972)set_rating(pf_keys, full_u, 50, 8) # 常见嫌疑人,The (1995)建议 = full_u*Vt*Vt.Ttop_ten_ratings = list(np.sort(recommendations)[:,-10:].flat)打印(预测评级值",top_ten_ratings)top_ten_recommended_product_ids = np.where(recommendations >= np.sort(recommendations)[:,-10:].min())[1]top_ten_recommended_product_ids = list(np.array(top_ten_recommended_product_ids))打印(预测评级prod_id",top_ten_recommended_product_ids)

然而,预测的评分似乎太高了:

<预类= 郎吡prettyprint-越权"> <代码>( '预测评级值',[313.67320347694897,315.30874327316576,317.1563289268388,317.45475214423948,318.19788673744563,319.93044594688428,323.92448427140653,324.12553531632761,325.41052886977582,327.12199687047649])('预测评级 prod_id', [49, 287, 309, 558, 744, 802, 1839, 2117, 2698, 3111])

这似乎是不正确的.任何提示表示赞赏.

解决方案

我认为如果您只关心电影的排名,上述方法会奏效.如果您想获得实际评分,那么似乎在维度/比例方面有所不同.

这里的想法是猜测新用户的潜在表示.通常,对于已经在分解中的用户,用户 i,你有他的潜在表示 u_i(model.userFeatures() 中的第 i 行),你得到他的评分使用 model.predict 的给定电影(电影 j),它基本上将 u_i 乘以产品 v_j 的潜在表示.如果与整个 v 相乘,您可以一次获得所有预测评分:u_i*v.

对于新用户,您必须从 full_u_new 猜测他的潜在表示 u_new 是什么.基本上,您需要 50 个系数来代表您对每个潜在产品因素的新用户亲和力.为简单起见,因为对于我的隐式反馈用例来说已经足够了,我只是使用了点积,基本上将新用户投影到产品潜在因子上:full_u_new*V^t 给你 50 系数,coeff i 是您的新用户与产品潜在因素 i 的相似程度.它特别适用于隐式反馈.所以,使用点积会给你,但它不会被缩放,它解释了你看到的高分.为了获得可用的分数,您需要更准确地缩放 u_new,我认为您可以使用余弦相似度来获得它,就像他们在 [here]https://github.com/apache/incubator-predictionio/blob/release/0.10.0/examples/scala-parallel-recommendation/custom-query/src/main/scala/ALSAlgorithm.scala

@ScottEdwards2000 在评论中提到的方法也很有趣,但不同.您确实可以在训练集中寻找最相似的用户.如果有多个,您可以获得平均值.我不认为它会做得太糟糕,但它是一种非常不同的方法,您需要完整的评分矩阵(以找到最相似的用户).获得一个亲密的用户肯定可以解决扩展问题.如果您设法使这两种方法都有效,则可以比较结果!

I'm predicting ratings in between processes that batch train the model. I'm using the approach outlined here: ALS model - how to generate full_u * v^t * v?

! rm -rf ml-1m.zip ml-1m
! wget --quiet http://files.grouplens.org/datasets/movielens/ml-1m.zip
! unzip ml-1m.zip
! mv ml-1m/ratings.dat .

from pyspark.mllib.recommendation import Rating

ratingsRDD = sc.textFile('ratings.dat') \
               .map(lambda l: l.split("::")) \
               .map(lambda p: Rating(
                                  user = int(p[0]), 
                                  product = int(p[1]),
                                  rating = float(p[2]), 
                                  )).cache()

from pyspark.mllib.recommendation import ALS

rank = 50
numIterations = 20
lambdaParam = 0.1
model = ALS.train(ratingsRDD, rank, numIterations, lambdaParam)

Then extract the product features ...

import json
import numpy as np

pf = model.productFeatures()

pf_vals = pf.sortByKey().values().collect()
pf_keys = pf.sortByKey().keys().collect()

Vt = np.matrix(np.asarray(pf_vals))

full_u = np.zeros(len(pf_keys))

def set_rating(pf_keys, full_u, key, val):
    try:
        idx = pf_keys.index(key)
        full_u.itemset(idx, val)
    except:
        pass

set_rating(pf_keys, full_u, 260, 9),   # Star Wars (1977)
set_rating(pf_keys, full_u, 1,   8),   # Toy Story (1995)
set_rating(pf_keys, full_u, 16,  7),   # Casino (1995)
set_rating(pf_keys, full_u, 25,  8),   # Leaving Las Vegas (1995)
set_rating(pf_keys, full_u, 32,  9),   # Twelve Monkeys (a.k.a. 12 Monkeys) (1995)
set_rating(pf_keys, full_u, 335, 4),   # Flintstones, The (1994)
set_rating(pf_keys, full_u, 379, 3),   # Timecop (1994)
set_rating(pf_keys, full_u, 296, 7),   # Pulp Fiction (1994)
set_rating(pf_keys, full_u, 858, 10),  # Godfather, The (1972)
set_rating(pf_keys, full_u, 50,  8)    # Usual Suspects, The (1995)

recommendations = full_u*Vt*Vt.T

top_ten_ratings = list(np.sort(recommendations)[:,-10:].flat)

print("predicted rating value", top_ten_ratings)

top_ten_recommended_product_ids = np.where(recommendations >= np.sort(recommendations)[:,-10:].min())[1]
top_ten_recommended_product_ids = list(np.array(top_ten_recommended_product_ids))

print("predict rating prod_id", top_ten_recommended_product_ids)

However the predicted ratings seem way too high:

('predicted rating value', [313.67320347694897, 315.30874327316576, 317.1563289268388, 317.45475214423948, 318.19788673744563, 319.93044594688428, 323.92448427140653, 324.12553531632761, 325.41052886977582, 327.12199687047649])
('predict rating prod_id', [49, 287, 309, 558, 744, 802, 1839, 2117, 2698, 3111])

This appears to be incorrect. Any tips appreciated.

解决方案

I think the approach mentioned would work if you only care about the ranking of the movies. If you want to get an actual rating there seem to be something of in terms dimension/scaling.

The idea here, is to guess the latent representation of your new user. Normally, for a user already in the factorization, user i, you have his latent representation u_i (the ith row in model.userFeatures()) and you get his rating for a given movie (movie j) using model.predict which basically multiply u_i by the latent representation of the product v_j. you can get all the predicted ratings at once if you multiply with the whole v: u_i*v.

For a new user you have to guess what is his latent representation u_new from full_u_new. Basically you want 50 coefficients that represent your new user affinity towards each of the latent product factor. For simplicity and since it was enough for my implicit feedback use case, I simply used the dot product, basically projecting the new user on the product latent factor: full_u_new*V^t gives you 50 coefficient, the coeff i being how much your new user looks like product latent factor i. and it works especially well with implicit feedback. So, using the dot product will give you that but it won't be scaled and it explains the high scores you are seeing. To get usable scores you need a more accurately scaled u_new, I think you could get that using the cosine similarity, like they did [here]https://github.com/apache/incubator-predictionio/blob/release/0.10.0/examples/scala-parallel-recommendation/custom-query/src/main/scala/ALSAlgorithm.scala

The approach mentioned by @ScottEdwards2000 in the comment is interesting too, but rather different. You could indeed look for the most similar user(s) in your training set. If there are more than one you could get the average. I don't think it would do too badly but it is a really different approach and you need the full rating matrix (to find the most similar user(s)). Getting one close user should definitely solve the scaling problem. If you manage to make both approach work you could compare the results!

这篇关于ALS 模型 - 预测 full_u * v^t * v 评分非常高的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆