为Mahout推荐器使用多个加权数据模型 [英] Utilizing multiple, weighed data models for a Mahout recommender
问题描述
我有一个基于用户相似性的布尔型偏好推荐器.我的数据集实质上包含以下关系:ItemId是用户已决定阅读的文章.我想添加第二个数据模型,其中包含ItemId是对特定主题的订阅.
I have a boolean preference recommender based on user similarity. My data set essentially contains relations where ItemId are articles the user has decided to read. I'd like to add a second data model containing where ItemId is a subscription to a particular topic.
我能想象的唯一方法是将两者合并在一起,以抵消订阅ID,以免它们与文章ID发生冲突.对于加权,我考虑了删除布尔值偏好设置并引入偏好分数,例如,文章子集的偏好分数为1,订阅子集的偏好分数为2.
The only way I can imagine doing this is by merging the two together, offsetting the subscription IDs so that they don't collide with the article IDs. For weighting I considered dropping the boolean preference setup and introducing preference scores, where the articles subset has a preference score of 1 (for example) and the subscriptions subset has a preference score of 2.
但是,我不确定这是否行得通,因为偏好得分与我所追求的权重并不完全相似;他们可能包含一些表示不满意的较低分数的概念.
I'm not sure if this would work, however, because the preference score isn't exactly analogous to the sort of weighting I'm after; they probably include some concept of lower scores representing dissatisfaction.
我必须想象有一种更好的方法可以做到这一点,或者至少我的计划有一些调整,可以使其按我的意愿行事.
I have to imagine there's a better way to do this or at least that there are tweaks to my plan which would make it work more along the lines I desire.
推荐答案
我认为您正在以正确的方式考虑它.是的,您想要一种比简单存在/不存在的订阅和文章更具表达力的方法,因为它们的含义有所不同.我建议选择可以反映其相对频率的权重.例如,如果用户在整个时间内阅读了10万篇文章,并进行了10000次订阅,那么您可以将订阅权重设置为"10",将阅读权重选择为"1".
I think you're thinking of it in the right way. Yes you want a bit more expressiveness than a simple exists/doesn't exist for subscriptions and articles since they mean somewhat different things. I would suggest picking weights that reflect their relative frequency. For example if users have read 100K articles over all time, and made 10000 subscriptions, then you might pick a subscription weight to be "10" and a read weight to be "1".
由于多种原因,如果您将这些值视为偏好得分,则效果不佳.如果您使用一种将其视为线性权重的方法,效果会更好.
This doesn't quite work if you treat those values as preference scores, for a number of reasons. It works better if you use an approach that treats them like what they are, which are linear weights.
我将向您指出ALS-WR算法,该算法是专门为此类输入设计的.例如:用于隐式反馈数据集的协作过滤
I would point you to the ALS-WR algorithm, which is specifically designed for this type of input. For example: Collaborative Filtering for Implicit Feedback Datasets
这在Mahout中作为Hadoop上的ParallelALSFactorizationJob
实现.尽管需要Hadoop,但效果很好. (尽管我确实在Mahout中编写了大多数推荐程序代码,但我对此不以为然.)
This is implemented in Mahout as ParallelALSFactorizationJob
on Hadoop. It works nicely though requires Hadoop. (I can't take credit for that, though I did write most of the recommender code in Mahout.)
广告:我正在将下一代"系统商业化,该系统是根据我在Mahout的工作发展而来的,即 Myrrix .它是ALS-WR的实现,非常适合您的输入. 下载并运行非常容易,并且不需要Hadoop.
Advertisement: I'm working on commercializing a "next generation" system, evolved from my work in Mahout, as Myrrix. It is an implementation of ALS-WR and is ideal for your kind of input. It's quite easy to download and run, and doesn't need Hadoop.
鉴于它可能直接适合您的问题,我不介意在这里插入它.
Given that it may be directly suitable for your problem I don't mind plugging it here.
这篇关于为Mahout推荐器使用多个加权数据模型的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!