构建协作过滤/推荐系统 [英] Building a Collaborative filtering / Recommendation System

查看:227
本文介绍了构建协作过滤/推荐系统的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在设计一个基于向用户推荐各种商品的概念的网站。 (即他们评分过的项目,添加到收藏夹列表中的项目等等)。例如亚马逊,Movielens和Netflix。

I'm in the process of designing a website that is built around the concept of recommending various items to users based on their tastes. (i.e. items they've rated, items added to their favorites list, etc.) Some examples of this are Amazon, Movielens, and Netflix.

现在,我的问题是,我不知道从哪里开始对于这个系统的数学部分。我愿意学习所需的数学,它只是我不知道需要什么类型的数学。

Now, my problem is, I'm not sure where to start in regards to the mathematical part of this system. I'm willing to learn the math that's required, it's just I don't know what type of math is required.

我看过一些出版物 Grouplens.org ,特别是 Towards a scalable kNN CF Algorithm:Exploring Effective Applications of Clustering 。 (pdf)我很好地了解一切,直到第5页预测生成

I've looked at a few of the publications over at Grouplens.org, specifically "Towards a Scalable kNN CF Algorithm: Exploring Effective Applications of Clustering." (pdf) I'm pretty good at understanding everything until page 5 "Prediction Generation"

p.s。我不是正在寻找解释发生了什么,虽然这可能是有帮助的,但我更感兴趣的数学我需要知道。

p.s. I'm not exactly looking for an explanation of what's going on, though that might be helpful, but I'm more interested in the math I need to know. This way I can understand what's going on.

推荐答案

让我解释一下作者介绍的过程(正如我所理解的):

Let me explain the procedure that the authors introduced (as I understood it):

输入:


  • 数据:用户,项目和用户对这些项目的评分(不是
    ,而是每个用户对所有
    项目评分)

  • :对某些项目有某些评分的新用户

  • 目标项:未被目标用户评分的项目, b预测其分值。

  • Training data: users, items, and ratings of users to these items (not necessarily each user rated all items)
  • Target user: a new user with some ratings of some items
  • Target item: an item not rated by target user that we would like to predict a rating for it.

输出


  • 目标用户对目标项目的预测

项,然后我们返回N顶项(最高预测评分)

This can be repeated for a bunch of items, and then we return the N-top items (highest predicted ratings)

过程:

算法是非常类似于天真的 KNN 方法(搜索所有培训数据以查找具有相似评分的用户到目标用户,然后合并他们的评级以给出预测[投票])。

这个简单的方法不能很好地缩放,因为用户/项目的数量增加。

Procedure:
The algorithm is very similar to the naive KNN method (search all training data to find users with similar ratings to the target user, then combine their ratings to give prediction [voting]).
This simple method does not scale very well, as the number of users/items increase.

所提出的算法是首先将训练用户聚集到 K 组(对类似地评级项目的人群),其中 K N N 是用户总数)。

然后,我们扫描这些群集以找到目标用户最接近哪一个看看所有的训练用户)。

最后,我们选择 l ,我们使用我们的预测作为平均加权的距离 l 集群。

The algorithm proposed is to first cluster the training users into K groups (groups of people who rated items similarly), where K << N (N is the total number of users).
Then we scan those clusters to find which one the target user is closest to (instead of looking at all the training users).
Finally we pick l out of those and we make our prediction as an average weighted by the distance to those l clusters.

请注意,使用的相似性度量是 kmeans ,我们就可以使用其他相似性指标作为欧氏距离或余弦距离。

Note that the similarity measure used is the correlation coefficient, and the clustering algorithm is the bisecting K-Means algorithm. We can simply use the standard kmeans, and we can use other similarity metrics as well such as Euclidean distance or cosine distance.

第5页的第一个公式是相关性的定义:

The first formula on page 5 is the definition of the correlation:

corr(x,y) = (x-mean(x))(y-mean(y)) / std(x)*std(y)

第二个公式基本上是加权平均值:

The second formula is basically a weighted average:

predRating = sum_i(rating_i * corr(target,user_i)) / sum(corr(target,user_i))
               where i loops over the selected top-l clusters

希望这澄清了一些东西:)

Hope this clarifies things a little bit :)

这篇关于构建协作过滤/推荐系统的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆