从至少出现在N个不同集合中的集合列表中查找所有子集 [英] Find all subsets from list of sets that appears in at least N different sets
问题描述
我正在研究将搜索引擎中的关键字按SERP中相同的网址数量进行分组的算法。每个组代表一个url,每个值是其中出现了URL的SERP关键字的ID。
I'm working on algorithm of grouping keywords from search engine by amount of same urls they have in SERP. Each group represents an url, and each value is an id of keyword for SERP where url appeared.
我有一个列表组:
groups = [
[1],
[1, 2 ,3],
[1, 2, 3, 4, 5],
[1, 2, 3 ,4],
[2, 3],
[4, 5, 6],
[4, 5, 7]
]
我需要提取至少出现在N个组中的 全部项目集以大小的顺序减小:
I need to fetch ALL sets of items that appears at least in N groups in order of "size" decreases:
在上面的示例中,N = 3两个子集:
[1、2、3]和[4、5]
In example above for N=3 we have two subsets: [1, 2 ,3] and [4, 5]
如我所见要获取它:
迭代1:查找出现至少3次(为[1、2、3])的最大集合并删除
Iteration 1: find largest set that appears at least 3 times (it's [1, 2 ,3]) and delete from all set, where it appears.
迭代后,我们得到:
groups = [
[1],
[4, 5],
[4],
[2, 3],
[4, 5, 6],
[4, 5, 7]
]
迭代2:查找出现至少3次的最大值(它是[4,5])
Iteration 2: find largest that appears at least 3 times (it's [4, 5])
迭代后,我们得到:
groups = [
[1],
[4],
[2, 3],
[6],
[7]
]
算法结束:因为至少没有更多的集合了分组进行3次。
End of algorithm: because there is no more sets that appears at least 3 times in groups.
您是否有算法来获取它们?
Do you have any idea for algorithm to fetch them?
N 在1到10之间。
p.s。群组清单相当大,从1000到10000个项目。数字是db中对象的ID。
p.s. groups list is quite big, from 1000 to 10000 items. Numbers is ids of objects in db.
推荐答案
这是一个 itertools
称为迭代1
的第一部分的方法。如果完全可以运行它,那么您可以循环运行它,并在每个阶段删除找到的项目。正如我在评论中指出的那样,仅对小 n
可行:
Here is an itertools
approach for the first part of what you call Iteration 1
. If it is feasible to run it at all, then you can run it repeatedly in a loop, removing the found items at each stage. As I indicated in the comments, it is only feasible for small n
:
import itertools
def intersect(sets):
#forms the intersection of all s in sets
#assumes that there is at least one
I = sets[0].copy()
for s in sets[1:]:
I = I & s
return I
def largestIntersection(sets,n):
maxSize = 0
maxSets = [set()]
for groupCombo in itertools.combinations(sets,n):
I = intersect(groupCombo)
if len(I) > maxSize:
maxSize = len(I)
maxSets = [I]
elif len(I) == maxSize:
maxSets.append(I)
return maxSets
例如:
>>> groups = [
[1],
[1, 2 ,3],
[1, 2, 3, 4, 5],
[1, 2, 3 ,4],
[2, 3],
[4, 5, 6],
[4, 5, 7]
]
>>> groups = [set(group) for group in groups]
>>> largestIntersection(groups,3)
[{1, 2, 3}]
编辑时:以下修改可能有帮助-取决于组中数字的分布和组的大小:
On The following modification might help -- depending on distribution of numbers in the groups and the size of the groups:
def guessSize(sets,n,trials):
return max(len(intersect(random.sample(sets,n))) for trial in range(trials))
def maxIntersection(sets,n,trials = 100000):
k = guessSize(sets,n,trials)
filteredSets = [s for s in sets if len(s) >= k]
return largestIntersection(filteredSets,n)
在尝试迭代相交之前,请减少组的数量。例如:
The idea is to first reduce the number of groups before you try to iterate over the intersections. For example:
#stress test:
import random
nums = list(range(1,101))
testGroups = [set(random.sample(nums,random.randint(1,100))) for n in range(1000)]
foundGroups = maxIntersection(testGroups,3)
计算 foundGroups
只需要几秒钟如果直接使用 largestIntersection(testGroups)
,则需要几分钟。另一方面,通过选择不同的随机参数,节省的时间可以忽略不计。
it takes only a few seconds to compute foundGroups
as opposed to several minutes if I had directly used largestIntersection(testGroups)
. On the other hand, with different choices of the random parameters the time saving becomes negligible.
在进一步编辑中:也许您可以做得更多进取的过滤:
On Further Perhaps you can be even more aggressive with the filtering:
import collections
def maxIntersection(sets,n,trials = 100000):
k = guessSize(sets,n,trials)
filteredSets = [s for s in sets if len(s) >= k]
counts = collections.Counter()
for s in filteredSets:
for x in s:
counts[x] +=1
t = {x for x,c in counts.items() if c >= k}
filteredSets = [s & t for s in filteredSets if len(s & t) >= k]
return largestIntersection(filteredSets,n)
这篇关于从至少出现在N个不同集合中的集合列表中查找所有子集的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!