构建协同过滤/推荐系统 [英] Building a Collaborative filtering / Recommendation System

查看:14
本文介绍了构建协同过滤/推荐系统的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在设计一个网站,该网站的理念是根据用户的口味向他们推荐各种商品.(即他们评价过的项目、添加到收藏夹列表中的项目等.)亚马逊、Movielens 和 Netflix 就是这方面的一些例子.

I'm in the process of designing a website that is built around the concept of recommending various items to users based on their tastes. (i.e. items they've rated, items added to their favorites list, etc.) Some examples of this are Amazon, Movielens, and Netflix.

现在,我的问题是,关于这个系统的数学部分,我不确定从哪里开始.我愿意学习所需的数学,只是我不知道需要什么类型的数学.

Now, my problem is, I'm not sure where to start in regards to the mathematical part of this system. I'm willing to learn the math that's required, it's just I don't know what type of math is required.

我在 Grouplens.org 上查看了一些出版物,特别是"迈向可扩展的 kNN CF 算法:探索聚类的有效应用."(pdf) 在第 5 页预测生成"之前,我非常擅长理解所有内容

I've looked at a few of the publications over at Grouplens.org, specifically "Towards a Scalable kNN CF Algorithm: Exploring Effective Applications of Clustering." (pdf) I'm pretty good at understanding everything until page 5 "Prediction Generation"

附言我并不是在寻找对正在发生的事情的解释,尽管这可能会有所帮助,但我对我需要知道的数学更感兴趣.这样我就可以理解是怎么回事了.

p.s. I'm not exactly looking for an explanation of what's going on, though that might be helpful, but I'm more interested in the math I need to know. This way I can understand what's going on.

推荐答案

让我解释一下作者介绍的过程(据我了解):

Let me explain the procedure that the authors introduced (as I understood it):

输入:

  • 训练数据:用户、项目以及用户对这些项目的评分(不是必须每个用户都评价所有项目)
  • 目标用户:对某些商品有一定评分的新用户
  • 目标商品:目标用户未评分但我们想要的商品预测它的评分.
  • Training data: users, items, and ratings of users to these items (not necessarily each user rated all items)
  • Target user: a new user with some ratings of some items
  • Target item: an item not rated by target user that we would like to predict a rating for it.

输出:

  • 目标用户对目标项目的预测

这可以对一堆项目重复,然后我们返回 N-top 项目(最高预测评分)

This can be repeated for a bunch of items, and then we return the N-top items (highest predicted ratings)

程序:
该算法非常类似于朴素的 KNN 方法(搜索所有训练数据以找到具有与目标用户相似的评分,然后结合他们的评分进行预测 [投票]).
随着用户/项目数量的增加,这种简单的方法不能很好地扩展.

Procedure:
The algorithm is very similar to the naive KNN method (search all training data to find users with similar ratings to the target user, then combine their ratings to give prediction [voting]).
This simple method does not scale very well, as the number of users/items increase.

所提出的算法是首先将训练用户分成K组(对物品评分相似的人群),其中K <<N(N是用户总数).
然后我们扫描这些集群以找出目标用户最接近哪一个(而不是查看所有训练用户).
最后,我们从中挑选出 l 个,并将我们的预测作为与这些 l 个聚类的距离加权的平均值.

The algorithm proposed is to first cluster the training users into K groups (groups of people who rated items similarly), where K << N (N is the total number of users).
Then we scan those clusters to find which one the target user is closest to (instead of looking at all the training users).
Finally we pick l out of those and we make our prediction as an average weighted by the distance to those l clusters.

注意,使用的相似度度量是correlation 系数,聚类算法是二分法K-Means 算法.我们可以简单地使用标准 kmeans,我们也可以使用其他相似度指标,例如欧几里得距离或余弦距离.

Note that the similarity measure used is the correlation coefficient, and the clustering algorithm is the bisecting K-Means algorithm. We can simply use the standard kmeans, and we can use other similarity metrics as well such as Euclidean distance or cosine distance.

第5页的第一个公式是相关性的定义:

The first formula on page 5 is the definition of the correlation:

corr(x,y) = (x-mean(x))(y-mean(y)) / std(x)*std(y)

第二个公式基本上是加权平均:

The second formula is basically a weighted average:

predRating = sum_i(rating_i * corr(target,user_i)) / sum(corr(target,user_i))
               where i loops over the selected top-l clusters

希望这能澄清一点:)

这篇关于构建协同过滤/推荐系统的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆