从至少出现在N个不同集合中的集合列表中查找所有子集 [英] Find all subsets from list of sets that appears in at least N different sets

查看:159
本文介绍了从至少出现在N个不同集合中的集合列表中查找所有子集的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在研究将搜索引擎中的关键字按SERP中相同的网址数量进行分组的算法。每个组代表一个url,每个值是其中出现了URL的SERP关键字的ID。

I'm working on algorithm of grouping keywords from search engine by amount of same urls they have in SERP. Each group represents an url, and each value is an id of keyword for SERP where url appeared.

我有一个列表组:

groups = [
    [1],
    [1, 2 ,3],
    [1, 2, 3, 4, 5],
    [1, 2, 3 ,4], 
    [2, 3],
    [4, 5, 6],
    [4, 5, 7]
]

我需要提取至少出现在N个组中的 全部项目集以大小的顺序减小:

I need to fetch ALL sets of items that appears at least in N groups in order of "size" decreases:

在上面的示例中,N = 3两个子集:
[1、2、3]和[4、5]

In example above for N=3 we have two subsets: [1, 2 ,3] and [4, 5]

如我所见要获取它:

迭代1:查找出现至少3次(为[1、2、3])的最大集合并删除

Iteration 1: find largest set that appears at least 3 times (it's [1, 2 ,3]) and delete from all set, where it appears.

迭代后,我们得到:

groups = [
        [1],
        [4, 5],
        [4], 
        [2, 3],
        [4, 5, 6],
        [4, 5, 7]
    ]

迭代2:查找出现至少3次的最大值(它是[4,5])

Iteration 2: find largest that appears at least 3 times (it's [4, 5])

迭代后,我们得到:

groups = [
        [1],
        [4], 
        [2, 3],
        [6],
        [7]
    ]

算法结束:因为至少没有更多的集合了分组进行3次。

End of algorithm: because there is no more sets that appears at least 3 times in groups.

您是否有算法来获取它们?

Do you have any idea for algorithm to fetch them?

N 在1到10之间。

p.s。群组清单相当大,从1000到10000个项目。数字是db中对象的ID。

p.s. groups list is quite big, from 1000 to 10000 items. Numbers is ids of objects in db.

推荐答案

这是一个 itertools 称为迭代1 的第一部分的方法。如果完全可以运行它,那么您可以循环运行它,并在每个阶段删除找到的项目。正如我在评论中指出的那样,仅对小 n 可行:

Here is an itertools approach for the first part of what you call Iteration 1. If it is feasible to run it at all, then you can run it repeatedly in a loop, removing the found items at each stage. As I indicated in the comments, it is only feasible for small n:

import itertools

def intersect(sets):
    #forms the intersection of all s in sets
    #assumes that there is at least one

    I = sets[0].copy()
    for s in sets[1:]:
        I = I & s
    return I

def largestIntersection(sets,n):
    maxSize = 0
    maxSets = [set()]
    for groupCombo in itertools.combinations(sets,n):
        I = intersect(groupCombo)
        if len(I) > maxSize:
            maxSize = len(I)
            maxSets = [I]
        elif len(I) == maxSize:
            maxSets.append(I)
    return maxSets

例如:

>>> groups = [
    [1],
    [1, 2 ,3],
    [1, 2, 3, 4, 5],
    [1, 2, 3 ,4],
    [2, 3],
    [4, 5, 6],
    [4, 5, 7]
]
>>> groups = [set(group) for group in groups]
>>> largestIntersection(groups,3)
[{1, 2, 3}]

编辑时:以下修改可能有帮助-取决于组中数字的分布和组的大小:

On The following modification might help -- depending on distribution of numbers in the groups and the size of the groups:

def guessSize(sets,n,trials):
    return max(len(intersect(random.sample(sets,n))) for trial in range(trials))

def maxIntersection(sets,n,trials = 100000):
    k = guessSize(sets,n,trials)
    filteredSets = [s for s in sets if len(s) >= k]
    return largestIntersection(filteredSets,n)

在尝试迭代相交之前,请减少组的数量。例如:

The idea is to first reduce the number of groups before you try to iterate over the intersections. For example:

#stress test:

import random
nums = list(range(1,101))
testGroups = [set(random.sample(nums,random.randint(1,100))) for n in range(1000)]
foundGroups = maxIntersection(testGroups,3)

计算 foundGroups 只需要几秒钟如果直接使用 largestIntersection(testGroups),则需要几分钟。另一方面,通过选择不同的随机参数,节省的时间可以忽略不计。

it takes only a few seconds to compute foundGroups as opposed to several minutes if I had directly used largestIntersection(testGroups). On the other hand, with different choices of the random parameters the time saving becomes negligible.

在进一步编辑中:也许您可以做得更多进取的过滤:

On Further Perhaps you can be even more aggressive with the filtering:

import collections
def maxIntersection(sets,n,trials = 100000):
    k = guessSize(sets,n,trials)
    filteredSets = [s for s in sets if len(s) >= k]
    counts = collections.Counter()
    for s in filteredSets:
        for x in s:
            counts[x] +=1
    t = {x for x,c in counts.items() if c >= k}
    filteredSets = [s & t for s in filteredSets if len(s & t) >= k]
    return largestIntersection(filteredSets,n)

这篇关于从至少出现在N个不同集合中的集合列表中查找所有子集的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆