ALS模型-预测的full_u * v ^ t * v评分很高 [英] ALS model - predicted full_u * v^t * v ratings are very high

查看:84
本文介绍了ALS模型-预测的full_u * v ^ t * v评分很高的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在批量训练模型的过程之间预测等级.我正在使用此处概述的方法: ALS模型-如何生成full_u * v ^ t * v?

I'm predicting ratings in between processes that batch train the model. I'm using the approach outlined here: ALS model - how to generate full_u * v^t * v?

! rm -rf ml-1m.zip ml-1m
! wget --quiet http://files.grouplens.org/datasets/movielens/ml-1m.zip
! unzip ml-1m.zip
! mv ml-1m/ratings.dat .

from pyspark.mllib.recommendation import Rating

ratingsRDD = sc.textFile('ratings.dat') \
               .map(lambda l: l.split("::")) \
               .map(lambda p: Rating(
                                  user = int(p[0]), 
                                  product = int(p[1]),
                                  rating = float(p[2]), 
                                  )).cache()

from pyspark.mllib.recommendation import ALS

rank = 50
numIterations = 20
lambdaParam = 0.1
model = ALS.train(ratingsRDD, rank, numIterations, lambdaParam)

然后提取产品功能...

Then extract the product features ...

import json
import numpy as np

pf = model.productFeatures()

pf_vals = pf.sortByKey().values().collect()
pf_keys = pf.sortByKey().keys().collect()

Vt = np.matrix(np.asarray(pf_vals))

full_u = np.zeros(len(pf_keys))

def set_rating(pf_keys, full_u, key, val):
    try:
        idx = pf_keys.index(key)
        full_u.itemset(idx, val)
    except:
        pass

set_rating(pf_keys, full_u, 260, 9),   # Star Wars (1977)
set_rating(pf_keys, full_u, 1,   8),   # Toy Story (1995)
set_rating(pf_keys, full_u, 16,  7),   # Casino (1995)
set_rating(pf_keys, full_u, 25,  8),   # Leaving Las Vegas (1995)
set_rating(pf_keys, full_u, 32,  9),   # Twelve Monkeys (a.k.a. 12 Monkeys) (1995)
set_rating(pf_keys, full_u, 335, 4),   # Flintstones, The (1994)
set_rating(pf_keys, full_u, 379, 3),   # Timecop (1994)
set_rating(pf_keys, full_u, 296, 7),   # Pulp Fiction (1994)
set_rating(pf_keys, full_u, 858, 10),  # Godfather, The (1972)
set_rating(pf_keys, full_u, 50,  8)    # Usual Suspects, The (1995)

recommendations = full_u*Vt*Vt.T

top_ten_ratings = list(np.sort(recommendations)[:,-10:].flat)

print("predicted rating value", top_ten_ratings)

top_ten_recommended_product_ids = np.where(recommendations >= np.sort(recommendations)[:,-10:].min())[1]
top_ten_recommended_product_ids = list(np.array(top_ten_recommended_product_ids))

print("predict rating prod_id", top_ten_recommended_product_ids)

但是预测的收视率似乎太高了:

However the predicted ratings seem way too high:

('predicted rating value', [313.67320347694897, 315.30874327316576, 317.1563289268388, 317.45475214423948, 318.19788673744563, 319.93044594688428, 323.92448427140653, 324.12553531632761, 325.41052886977582, 327.12199687047649])
('predict rating prod_id', [49, 287, 309, 558, 744, 802, 1839, 2117, 2698, 3111])

这似乎是不正确的.任何提示表示赞赏.

This appears to be incorrect. Any tips appreciated.

推荐答案

我认为,如果您只关心电影的排名,上述方法将行得通.如果您想获得实际的评分,则似乎在尺寸/比例方面有所体现.

I think the approach mentioned would work if you only care about the ranking of the movies. If you want to get an actual rating there seem to be something of in terms dimension/scaling.

这里的想法是猜测新用户的潜在代表.通常,对于已经处于分解中的用户i,您具有其潜在表示u_i(在model.userFeatures()中的第i行),并且使用model.predict可以得到给定电影(电影j)的评级,基本上, u_i由产品v_j的潜在表示.如果将整个v乘以u_i*v,则可以一次获得所有预测的收视率.

The idea here, is to guess the latent representation of your new user. Normally, for a user already in the factorization, user i, you have his latent representation u_i (the ith row in model.userFeatures()) and you get his rating for a given movie (movie j) using model.predict which basically multiply u_i by the latent representation of the product v_j. you can get all the predicted ratings at once if you multiply with the whole v: u_i*v.

对于新用户,您必须猜测full_u_new中他的潜在表示形式u_new. 基本上,您需要50个系数来表示您对每个潜在乘积因子的新用户亲和力. 为简单起见,由于它足以满足我的隐式反馈用例,因此我只使用了点积,基本上将新用户投射到了产品潜在因子上:full_u_new*V^t给出了50个系数,系数等于新用户的外观像产品潜在因子i.尤其适用于隐式反馈. 因此,使用点积可以为您带来好处,但不会缩放,并且可以解释您所看到的高分. 要获得可用分数,您需要更精确地缩放的u_new,我认为您可以使用余弦相似度来获得分数,就像他们在[here]

For a new user you have to guess what is his latent representation u_new from full_u_new. Basically you want 50 coefficients that represent your new user affinity towards each of the latent product factor. For simplicity and since it was enough for my implicit feedback use case, I simply used the dot product, basically projecting the new user on the product latent factor: full_u_new*V^t gives you 50 coefficient, the coeff i being how much your new user looks like product latent factor i. and it works especially well with implicit feedback. So, using the dot product will give you that but it won't be scaled and it explains the high scores you are seeing. To get usable scores you need a more accurately scaled u_new, I think you could get that using the cosine similarity, like they did [here]https://github.com/apache/incubator-predictionio/blob/release/0.10.0/examples/scala-parallel-recommendation/custom-query/src/main/scala/ALSAlgorithm.scala

@ ScottEdwards2000在评论中提到的方法也很有趣,但有很大的不同.您确实可以在训练集中寻找最相似的用户.如果多于一个,您可以得到平均值.我认为这样做不会太糟,但这是一种完全不同的方法,您需要完整的评分矩阵(以找到最相似的用户).吸引一位亲密用户肯定可以解决扩展问题.如果您设法使两种方法都可行,则可以比较结果!

The approach mentioned by @ScottEdwards2000 in the comment is interesting too, but rather different. You could indeed look for the most similar user(s) in your training set. If there are more than one you could get the average. I don't think it would do too badly but it is a really different approach and you need the full rating matrix (to find the most similar user(s)). Getting one close user should definitely solve the scaling problem. If you manage to make both approach work you could compare the results!

这篇关于ALS模型-预测的full_u * v ^ t * v评分很高的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆