推荐系统中的分裂 [英] Splitting in Recommender System

查看:23
本文介绍了推荐系统中的分裂的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个包含 Userid、ItemID、Ratings 的 Spark 数据框.我正在构建一个推荐系统.

I have a Spark Dataframe of Userid, ItemID, Ratings. I am building a recommender system.

数据如下:

originalDF.show(5)
+----+----+------+
|user|item|rating|
+----+----+------+
| 353|   0|     1|
| 353|   1|     1|
| 353|   2|     1|
| 354|   3|     1|
| 354|   4|     1|
+----+----+------+

它拥有 56K 独立用户和 85K 独立项目.

It has 56K unique users and 8.5K unique items.

因此,如果您看到每个用户 ID 都有每个项目和相应评级的记录 (RoW).所以每个用户 id 有多个记录.

So if you see each UserID has a record (RoW) for each Item and corresponding rating. So multiple records per user id.

现在我通过随机拆分 0.6、0.2、0.2 % 将其拆分为训练、验证和测试.所以基本上 60% 的随机记录用于训练,20% 用于验证,其余 20% 用于测试,如下所示:

Now I split this into train, val and test by taking a random split of 0.6,0.2,0.2 %. So basically 60% of random records go for training, 20% for validation and remaining 20% for test as below:

random_split=originalDF.randomSplit(split_perc,seed=20)

return random_split[0],random_split[1],random_split[2]

这给我留下了以下数据集计数

This leaves me with following dataset counts

train,validation,test=train_test_split(split_sdf,[0.6,0.2,0.2])
​
print "Training size is {}".format(train.count())
print "Validation size is {}".format(validation.count())
print "Test size is {}".format(test.count())
'/'
print "Original Dataset Size is {}".format(split_sdf.count())
Training size is 179950
Validation size is 59828
Test size is 60223
Original Dataset Size is 300001

现在我在训练数据上训练 Spark pyspark.ml.ALS 算法.

Now I train the Spark pyspark.ml.ALS algorithm on training data.

als = ALS(rank=120, maxIter=15, regParam=0.01, implicitPrefs=True)
model = als.fit(train)

当我检查模型对象中的 userFactors 和 itemFactors 时,我得到了这个:

When I check the userFactors and itemFactors from the model object I get this:

itemF=model.itemFactors
itemF.toPandas().shape
Out[111]:
(7686, 2)
In [113]:

userF=model.userFactors
userF.toPandas().shape
Out[113]:
(47176, 2)

这意味着它只给我一个预测因子矩阵.训练数据中的唯一用户和项目.

Which means it is only giving me a predicted factor matrix of the no. of unique users and items in training data.

现在如何预测每个用户的所有项目?.

Now how do I get prediction for all the items for each user?.

如果我这样做

prediction=model.transform(originalDF)

其中 OriginalDF 是整个数据集,它被分解为训练、验证和测试,这会为每个用户的所有项目提供预测吗?.

where OriginalDF is the whole dataset which was broken into train,val and test would that give prediction for all items for each user?.

我的问题是,如果我的数据集有 56K 用户 X 8.5K 项目,那么我想找到相同 56K X8.5K 的预测矩阵,而不仅仅是 47K X7.6K 训练数据.

My question is if my dataset had 56K users X 8.5K items then I want to find prediction matrix for the same 56K X8.5K and not just the 47K X7.6K training data.

我在这里做错了什么?我了解数据仅适用于 47k X7.6K 训练数据,而不是原始 56k X8.5K 评级数据.那么我是否将数据拆分为 train,val 错误?

What am I doing wrong here?. I understand the data works only on 47k X7.6K training data instead of the original 56k X8.5K ratings data. So am I splitting the data into train,val wrong?

我知道推荐系统应该为每个用户随机屏蔽某些项目的一些评分,并将剩余的用于训练并在这些屏蔽值上进行测试.我在这里做了同样的事情,因为用户的每条记录都是对不同项目的评级.当我们随机拆分时,我们实际上是在掩盖用户的一些评分,而不是将它们用于训练.

I know for recommender system one should randomly mask some ratings for some items for each user and use the remaining for training and test it on those masked values. I did the same here since each record for a user is a rating for a different item. When we split randomly we are essentially masking some of the rating for a user and not using them for training.

请指教.

在具有用户 X 项目矩阵(56K 用户 X 8.5 个项目)的典型推荐系统中

In a typical Recommender System with user X item matrix (56K users X 8.5 items)

我们基本上为每个用户屏蔽(使其为 0)一些随机项目评级.然后将整个矩阵传递给推荐算法,并将其分解为两个因子矩阵的乘积.

We basically mask (make it to 0) some random item ratings for each user. Then this whole matrix is passed to the recommender algo and it breaks it into a product of two factors matrix.

但是在 Spark 中,我们不使用 Userx 项目矩阵.我们基本上将每个项目列的评分作为每个用户的单独行,而不是有 8.5K 个项目列.

However in Spark, we don't use a Userx item matrix. We basically put each item column ratings as individual row for each user instead of having 8.5K item columns.

因此,如果您在原始用户-项目矩阵中看到屏蔽(使某些项目评级为 0),则与不在火花数据框中为每个用户使用一些随机行相同.对?

So if you see masking (making some item ratings to 0) in original user-item matrix is then same as not using some random rows for each user in spark data frame. Right?

这是我找到了一种将数据拆分(我也使用过)的方法,将数据拆分为 train 和 val

Here is I found one way to split (which is what I used too) the data into train and val

training_RDD, validation_RDD, test_RDD = small_ratings_data.randomSplit([6, 2, 2], seed=0L)
validation_for_predict_RDD = validation_RDD.map(lambda x: (x[0], x[1]))
test_for_predict_RDD = test_RDD.map(lambda x: (x[0], x[1]))

我在这里也使用了类似的 randomSplit 东西.所以我不确定这里有什么问题.

I used the similar randomSplit thing here too. So I am not sure what is wrong here.

我可以理解,由于训练数据中没有所有用户和项目,项目因素矩阵也将只有那么多用户和项目因素.那么我该如何克服呢?最后,我基本上需要一个所有用户和项目的预测矩阵.

I can understand that since the training data does not have all users and items, the item factors matrix would also only have that many user and item factors. So how do I overcome that?. In the end I basically needs a matrix of predictions for all users and items.

推荐答案

所有 ids:

  • 用户
  • 产品

您希望预测必须存在于训练集中.使用随机分割不是一种可以用来确保的方法(它不等同于数据屏蔽).

for which you want predictions have to be present in the training set. Using random split is not a method which can be used to ensure that (it is not equivalent to data masking).

这篇关于推荐系统中的分裂的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆