为Mahout推荐器使用多个加权数据模型 [英] Utilizing multiple, weighed data models for a Mahout recommender

查看:116
本文介绍了为Mahout推荐器使用多个加权数据模型的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个基于用户相似性的布尔型偏好推荐器.我的数据集实质上包含以下关系:ItemId是用户已决定阅读的文章.我想添加第二个数据模型,其中包含ItemId是对特定主题的订阅.

I have a boolean preference recommender based on user similarity. My data set essentially contains relations where ItemId are articles the user has decided to read. I'd like to add a second data model containing where ItemId is a subscription to a particular topic.

我能想象的唯一方法是将两者合并在一起,以抵消订阅ID,以免它们与文章ID发生冲突.对于加权,我考虑了删除布尔值偏好设置并引入偏好分数,例如,文章子集的偏好分数为1,订阅子集的偏好分数为2.

The only way I can imagine doing this is by merging the two together, offsetting the subscription IDs so that they don't collide with the article IDs. For weighting I considered dropping the boolean preference setup and introducing preference scores, where the articles subset has a preference score of 1 (for example) and the subscriptions subset has a preference score of 2.

但是,我不确定这是否行得通,因为偏好得分与我所追求的权重并不完全相似;他们可能包含一些表示不满意的较低分数的概念.

I'm not sure if this would work, however, because the preference score isn't exactly analogous to the sort of weighting I'm after; they probably include some concept of lower scores representing dissatisfaction.

我必须想象有一种更好的方法可以做到这一点,或者至少我的计划有一些调整,可以使其按我的意愿行事.

I have to imagine there's a better way to do this or at least that there are tweaks to my plan which would make it work more along the lines I desire.

推荐答案

我认为您正在以正确的方式考虑它.是的,您想要一种比简单存在/不存在的订阅和文章更具表达力的方法,因为它们的含义有所不同.我建议选择可以反映其相对频率的权重.例如,如果用户在整个时间内阅读了10万篇文章,并进行了10000次订阅,那么您可以将订阅权重设置为"10",将阅读权重选择为"1".

I think you're thinking of it in the right way. Yes you want a bit more expressiveness than a simple exists/doesn't exist for subscriptions and articles since they mean somewhat different things. I would suggest picking weights that reflect their relative frequency. For example if users have read 100K articles over all time, and made 10000 subscriptions, then you might pick a subscription weight to be "10" and a read weight to be "1".

由于多种原因,如果您将这些值视为偏好得分,则效果不佳.如果您使用一种将其视为线性权重的方法,效果会更好.

This doesn't quite work if you treat those values as preference scores, for a number of reasons. It works better if you use an approach that treats them like what they are, which are linear weights.

我将向您指出ALS-WR算法,该算法是专门为此类输入设计的.例如:用于隐式反馈数据集的协作过滤

I would point you to the ALS-WR algorithm, which is specifically designed for this type of input. For example: Collaborative Filtering for Implicit Feedback Datasets

这在Mahout中作为Hadoop上的ParallelALSFactorizationJob实现.尽管需要Hadoop,但效果很好. (尽管我确实在Mahout中编写了大多数推荐程序代码,但我对此不以为然.)

This is implemented in Mahout as ParallelALSFactorizationJob on Hadoop. It works nicely though requires Hadoop. (I can't take credit for that, though I did write most of the recommender code in Mahout.)

广告:我正在将下一代"系统商业化,该系统是根据我在Mahout的工作发展而来的,即 Myrrix .它是ALS-WR的实现,非常适合您的输入. 下载并运行非常容易,并且不需要Hadoop.

Advertisement: I'm working on commercializing a "next generation" system, evolved from my work in Mahout, as Myrrix. It is an implementation of ALS-WR and is ideal for your kind of input. It's quite easy to download and run, and doesn't need Hadoop.

鉴于它可能直接适合您的问题,我不介意在这里插入它.

Given that it may be directly suitable for your problem I don't mind plugging it here.

这篇关于为Mahout推荐器使用多个加权数据模型的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆