有效地计算大型相似度矩阵 [英] Efficiently calculate large similarity matrix

查看:126
本文介绍了有效地计算大型相似度矩阵的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在一个我目前正在工作的项目中,大约有200,000个用户.对于这些用户中的每一个,我们都定义了与其他用户的相似性度量.这产生了200000x200000的相似度矩阵.有点大.天真的方法(使用Ruby)计算每个条目将需要几天的时间.

In a project I'm currently working reside about 200,000 users. For each of these users we defined a similarity measure with regard to an other user. This yields a similarity matrix of 200000x200000. A tad large. A naive approach (in Ruby) of calculating each entry would take days.

我可以采用哪些策略使矩阵字段的计算可行?我应该把这只野兽放在哪个数据存储中?

What strategies can I employ to to make computing the matrix fields feasible? In what data store should I put this beast?

推荐答案

答案有些零碎,您告诉我们要提供好的答案的地方仍然有太多空白,但是您可以填写那些你自己.从您告诉我们的所有信息来看,我认为任务的主要部分不是有效地计算大型相似性矩阵,而是主要部分是从此类矩阵中有效地检索值并有效地更新矩阵.

Here are some bits and pieces of an answer, there are still too many gaps in what you've told us to permit a good answer, but you can fill those in yourself. From everything you've told us I don't think that the major part of your task is to efficiently calculate a large similarity matrix, I think that the major parts are to efficiently retrieve values from such a matrix and to efficiently update the matrix.

我们已经确定矩阵是稀疏且对称的;了解稀疏性将很有用.这样可以大大减少存储需求,但是我们不知道要多少.

As we've already determined the matrix is sparse and symmetric; it would be useful to know how sparse. This reduces the storage requirements considerably, but we don't know by how much.

您已经向我们介绍了有关用户个人资料的更新,但是您的相似度矩阵是否需要频繁更新?我的期望(另一个假设)是,当用户修改其个人资料时,相似性度量不会迅速或急剧变化.据此,我假设使用过时几分钟(甚至几小时)的相似性度量不会造成任何严重危害.

You've told us a bit about updates to user profiles but does your similarity matrix have to be updated as frequently ? My expectation (another assumption) is that similarity measures do not change quickly or sharply when a user modifies his/her profile. From this I hypothesise that working with a similarity measure which is a few minutes (even a few hours) out of date won't do any serious harm.

我认为所有这些都将我们带入了数据库领域,这应该支持快速访问您指示的卷的存储的相似性度量.我希望每隔一段时间就可以批量更新这些措施,并且仅针对那些个人资料已更改的用户进行这些措施的更新,以适应您的需求和计算机功能的可用性.

I think that all this takes us into the domain of databases, which should support fast access to stored similarity measures of the volumes you indicate. I'd be looking to do batch updates of the measures, and only of the measures for users whose profiles have changed, at an interval to suit your demands and availability of computer power.

关于相似性矩阵的第一个版本的初始创建,因此如果在后台花费一周的时间,您只需要执行一次.

As for the initial creation of the first version of the similarity matrix, so what if it takes a week in the background, you're only going to do it once.

这篇关于有效地计算大型相似度矩阵的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆