寻找高交集的最快算法 [英] Quickest algorithm for finding sets with high intersection

查看:23
本文介绍了寻找高交集的最快算法的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有大量的用户 ID(整数),可能数百万.这些用户都属于不同的组(整数组),因此大约有 1000 万个组.

I have a large number of user IDs (integers), potentially millions. These users all belong to various groups (sets of integers), such that there are on the order of 10 million groups.

为了简化我的示例并了解其本质,我们假设所有组都包含 20 个用户 ID.

To simplify my example and get to the essence of it, let's assume that all groups contain 20 user IDs.

我想找到所有具有 15 或更大交集的整数集对.

I want to find all pairs of integer sets that have an intersection of 15 or greater.

我应该比较每一对集合吗?(如果我保留一个将用户 ID 映射到设置成员资格的数据结构,则没有必要这样做.)最快的方法是什么?也就是说,我的底层数据结构应该是什么来表示整数集?已排序的集合,未排序的——散列能以某种方式帮助吗?我应该使用什么算法来计算集合交集)?我更喜欢与 C/C++(尤其是 STL)相关的答案,但也欢迎任何更一般的算法见解.

Should I compare every pair of sets? (If I keep a data structure that maps userIDs to set membership, this would not be necessary.) What is the quickest way to do this? That is, what should my underlying data structure be for representing the integer sets? Sorted sets, unsorted---can hashing somehow help? And what algorithm should I use to compute set intersection)? I prefer answers that relate C/C++ (especially STL), but also any more general, algorithmic insights are welcome.

更新另外,请注意,我将在共享内存环境中并行运行它,因此首选可以干净地扩展到并行解决方案的想法.

Update Also, note that I will be running this in parallel in a shared memory environment, so ideas that cleanly extend to a parallel solution are preferred.

另外,请注意,绝大多数集合对的交集大小为 0——这意味着使用将用户 ID 映射到集合以避免计算每对的交集的数据结构可能是有利的套数.

Also, note that the vast majority of set-pairs will have an intersection size of 0---meaning that it might be advantageous to use a data-structure that mapped user IDs to sets to avoid calculating the intersection of every pair of sets.

推荐答案

我会按照你的建议去做:将用户映射到他们的组.也就是说,我会为每个用户保留一个组 ID 列表.然后我会使用以下算法:

I would do exactly what you propose: map users to their group. That is, I would keep a list of group ids for every user. Then I would use the following algorithm:

foreach group:
  map = new Map<Group, int>  // maps groups to count
  foreach user in group:
    foreach userGroup in user.groups:
      map[userGroup]++
      if( map[userGroup] == 15 && userGroup.id > group.id )
        largeIntersection( group, userGroup )

假设您有 G 组,每个组平均包含 U 用户,并且假设这些用户平均属于 g 组,那么这个将在 O( G*U*g ) 中运行.考虑到您的问题,这可能比在 O(G*G*U) 中运行的组的天真成对比较要快得多.

Given you have G groups each containing U users on average, and given that these users belong to g groups on average, then this will run in O( G*U*g ). Which, given your problem, is probably much faster than the naive pairwise comparison of groups which runs in O(G*G*U).

这篇关于寻找高交集的最快算法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆