通过匹配N个顶点构建一个无向加权图 [英] Build an undirected weighted graph by matching N vertices

查看:158
本文介绍了通过匹配N个顶点构建一个无向加权图的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想通过比较他/她的兴趣与所有其他人的兴趣来为特定用户推荐最匹配的前10名匹配。我在用户之间构建了一个无向加权图,其中weight =两个用户之间的匹配分数。

我已经有一组N个用户: S.对于S中的任何用户U,我有一组兴趣I.经过很长时间(一个星期?)之后,我创建了一个具有一组兴趣的新用户U并将其添加到S.为了生成这个新的图用户,我将迭代地比较新用户的兴趣集I和S中所有用户的兴趣集。问题在于这个所有用户部分。

让我们来讨论比较兴趣的功能。对一组兴趣感兴趣我是一个字符串。我使用WikipediaMiner来比较两个字符串/兴趣(它使用维基百科链接来推断两个字符串有多密切相关,例如Billy Jean& Thriller ==>高匹配,Brad Pitt& Jamaica ==>低匹配等等等等) 。我已经询问了有关此问题的问题(以查看是否有比我目前使用的解决方案更好的解决方案)。因此,上述功能不可忽略总的来说,当我们比较数千人(可能是数百万人)的用户和他们的数百个兴趣时,需要花费大量时间。对于100,000个用户,我不能在短时间内进行100,000次用户比较(<< ; 30秒),但是,我必须在30秒内给出前10条建议,可能是初步建议,然后在接下来的1分钟左右对其进行改进,计算改进后的建议,只需比较1位用户与N位



问题:

请建议一种算法,方法或工具使用它我可以改善我的情况或解决我的问题。

解方案

我能想到的唯一解决问题的方法,因为下面的东西
取决于利益之间的相互关系的性质的结果。



=>步骤:1如标题所示。以兴趣作为顶点并以它们之间的加权匹配作为边来构建一个无向加权图。 b

=>步骤:2 - 聚集兴趣。 (最复杂的)

Kmeans是一种常用的聚类算法,但基于
K维矢量空间进行工作。请参阅wiki,了解K-means作品。
它使所有聚类的总和(每个点的距离^ 2之和,并且称为聚类的中心)之和最小。在你的情况下,没有可用的维度。所以如果你可以通过创建某种规则来应用最小化逻辑,那么尝试两个顶点之间的距离,更高的匹配=>更小的距离,反之亦然(wiki-miner提供的不同匹配级别是什么?)。选择集群的Mean作为所选集合中最连接的顶点,页面排名听起来是计算最连接顶点的一个好选择。



Pair计数F-Measure听起来像适合你的需要(加权图),检查其他选项。



(注意:不断修改此步骤直到正确的集群算法被发现,
是距离规则的正确校准,没有聚类等。)


=>步骤:3 - 评估群集



在这里就像校准一些东西以适应您的需要。
检查簇,重新评估:
簇的数量,簇间距离,簇内顶点之间的距离,簇的大小,
时间\精确度权衡(比较最终匹配结果没有任何clustring)
goto:step-2直到这个评价是令人满意的。

=> step:4 - Examinie new inerest



遍历所有集群,计算每个集群中的主体性,根据高连通性对集群进行排序,对于排序后的集群中的前x%
进行排序并筛选出高度关联的利益。 p>

=>步骤:5 - 匹配用户



反向查找使用所获得的兴趣的所有用户-4,比较两个用户的所有兴趣,生成一个分数。



=> step:6 - 除了上面的
之外,您可以分配负载机器可以用于群集机器n群集)到多个系统处理器,基于流量和东西。

什么是这个问题的应用程序,什么是预期的流量?




另一种解决方案来找到新的兴趣和集群中的兴趣集C.
Wiki-Miner运行在一组wiki文档上,让我称它为UNIVERSE。

1:对于每个集群提取和维护(索引,lucene可能都很方便),将高相关性文档集(我称之为HRDC)从Universe 。所以如果你有'N'个集群,你就有'N'HRDC。 2:当新的兴趣来到时,对于每个HRDC,找到集群连通性=Universe中HRDC /命中率的命中率。 p>

3:将Conectivity with Cluster排序并选择Highly connected clusters。

4:具有新兴趣的聚类中的顶点或高度连接的顶点(使用页面排名),取决于适合您的时间\精确度折衷。

Problem:
I want to suggest the top 10 most compatible matches for a particular user, by comparing his/her 'interests' with interests of all others. I'm building an undirected weighted graph between users, where the weight = match score between the two users.

I already have a set of N users: S. For any user U in S, I have a set of interests I. After a long time (a week?) I create a new user U with a set of interests and add it to S. To generate a graph for this new user, I'm comparing interest set I of the new user with the interest sets of all the users in S, iteratively. The problem is with this "all the users" part.

Let's talk about the function for comparing interests. An interest in a set of interests I is a string. I'm comparing two strings/interests using WikipediaMiner (it uses Wikipedia links to infer how closely related two strings are. eg. Billy Jean & Thriller ==> high match, Brad Pitt & Jamaica ==> low match blah blah). I've asked a question about this too (to see if there's a better solution than the one I'm currently using.

So, the above function takes non-negligible time, and in total, it'll take a HUGE time when we compare thousands (maybe millions?) of users and their hundreds of interests. For 100,000 users, I can't afford to make 100,000 user comparisons in a small time (<30sec) in this way. But, I have to give the top 10 recommendations within 30 secs, possibly a preliminary recommendation, and then improve on it in the next 1 min or so, calculate improved recommendations. Simply comparing 1 user vs the N users sequentially is too slow.

Question:
Please suggest an algorithm, method or tool using which I can improve my situation or solve my problem.

解决方案

I could think of only an approach to solve the problem, since the outcomes of below stuff depend on the nature of inter-relation between interests.

=>step:1 As your title says.Build an undirected weighted graph with interests as vertices and the weighted match between them as edges.

=>step:2 - cluster the interests. (Most complex)

Kmeans is a commonly used clustering algo, but works on based on K-Dimensional vector space.refer wiki to see how K-means works. it minimizes the sum of (sum of distance^2 for each point and say the center of the cluster) for all clusters. In your case, there are no dimensions available. so try if you can apply the minimizing logic applied there by creating some kind of rule, for distance between two vertices, higher match => lesser distance and vice versa (what are the different matching levels provided by wiki-miner?). chose the Mean of cluster as say the most connected vertex in the chosen set, page ranking sounds to be a good option for "figuring the most connected vertex ".

"Pair-counting F-Measure" sounds like it suit's your need (weighted graph), check for other options available.

(Note: keep modifying this step untill a right clustering algo is found and the right calibration for distance rule, no of clusters etc are found. )

=>Step:3 - Evaluate the clusters

from here on its like calibrating a couple things to fit your need. Examine the clusters, reevaluate : the number of clusters , inter-cluster distance, distance between vertices inside clusters, size of clusters, time\precision trade-off (compare final - match results without any clustring) goto: step-2 untill this evaluation is satisfactory.

=>step:4 - Examinie new inerest

iterate thru all clusters, calculate conectivity in each cluster, sort clusters based on high connectivity, for the top x% of sorted clusters sort and filter out the highly connected interests.

=>step:5 - Match User

reverse look up set of all users using the interests obtained out of step-4, compare all interests for both users, generate a score.

=>step:6 - Apart form the above you can distribute the load (multiple machines can be used for clusters machine-n clusters) to multiple systems\processors, based on the traffic and stuff.

what is the application for this problem, whats the expected traffic?


Another solution to find the connectivity between the new interest and "set of interests in Cluster" C. Wiki-Miner runs on a set of wiki documents, let me call it the UNIVERSE.

1:for each cluster fetch and maintain(index, lucene might be handy) the "set of high relevent docs"(I am calling it HRDC) out of the UNIVERSE. so you have 'N' HRDC's if you got 'N' clusters.

2:when a new interest comes find "Conectivity with Cluster" = "Hit ratio of interest in HRDC/Hit ratio of interest in UNIVERSE" for each HRDC.

3:Sort "Conectivity with Cluster"'s and choose the Highly connected clusters.

4:Either compare all the vertices in the cluster with the new interest or the highly connected vertices (using Page Ranking), depending on the time\Precision trade off , that suits you.

这篇关于通过匹配N个顶点构建一个无向加权图的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆