快速独特的组合(从具有重复项的列表中)而无需查找 [英] FAST unique combinations (from list with duplicates) WITHOUT LOOKUPS

查看:79
本文介绍了快速独特的组合(从具有重复项的列表中)而无需查找的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我似乎尽管在线上有很多算法和函数可以从唯一项目列表中生成任意大小的唯一组合,但是对于非唯一项目列表(例如,包含相同值的重复项的列表.)

I seems that in spite of the fact that there are online plenty of algorithms and functions for generating unique combinations of any size from a list of unique items, there is none available in case of a list of non-unique items (i.e. list containing repetitions of same value.)

问题是如何在所有生成器函数中实时生成 来自非唯一列表的唯一组合没有 在计算上需要过滤掉重复项吗?

The question is how to generate ON-THE-FLY in a generator function all the unique combinations from a non-unique list without the computational expensive need of filtering out duplicates?

现在,由于有一个针对这个问题的奖励动机,因此更容易提供关于我期望实现的目标的明确信息:

Now as there is a bounty motivated answer to the question it is easier to provide more clarity about what I expect to achieve:

首先让我们提供代码,说明如何检查组合comboB是否被视为与另一个组合(comboA)重复:

First let's provide code illustrating how to check if a combination comboB is considered to be a duplicate of another one (comboA):

comboA = [1,2,2]
comboB = [2,1,2]
print("B is a duplicate of A:", comboA.sort()==comboB.sort())

在给定的B示例中,A是A的副本,并且print()打印出 True .

In the given example of B is a duplicate of A and the print() prints True.

在非唯一列表的情况下,获取能够即时提供唯一组合的生成器功能的问题已在此处解决:

The problem of getting a generator function capable of providing unique combinations on-the-fly in case of non-unique list is solved here: Getting unique combinations from a non-unique list of items, FASTER?, but the provided generator function needs lookups and requires memory what causes problems in case of a huge amount of combinations.

在当前版本的提供的答案功能中,无需任何查找即可完成工作,但这里似乎是正确的答案,但是...

The in the current version of the answer provided function does the job without any lookups and appears to be the right answer here, BUT ...

摆脱查找的背后目的是在出现重复列表的情况下加快唯一组合的生成.

The goal behind getting rid of lookups is to speed up the generation of unique combinations in case of a list with duplicates.

我最初(编写此问题的第一个版本)错误地认为,不需要创建用于确保唯一性所需的查找集的代码将比需要查找的代码更具优势. 并非如此.至少并非总是如此.到目前为止提供的答案中的代码未使用查找,但是在没有冗余列表或列表中仅包含少量冗余项的情况下,将花费更多时间来生成所有组合.

I have initially (writing the first version of this question) wrongly assumed that code which doesn't require creation of a set used for lookups needed to assure uniqueness is expected to give an advantage over code needing lookups. It is not the case. At least not always. The code in up to now provided answer does not using lookups, but is taking much more time to generate all the combinations in case of no redundant list or if only a few redundant items are in the list.

有一些时间来说明当前情况:

Here some timings to illustrate the current situation:

-----------------
 k: 6 len(ls): 48
Combos   Used Code                               Time
---------------------------------------------------------
12271512 len(list(combinations(ls,k)))       :  2.036 seconds
12271512 len(list(subbags(ls,k)))            : 50.540 seconds
12271512 len(list(uniqueCombinations(ls,k))) :  8.174 seconds
12271512 len(set(combinations(sorted(ls),k))):  7.233 seconds
---------------------------------------------------------
12271512 len(list(combinations(ls,k)))       :  2.030 seconds
       1 len(list(subbags(ls,k)))            :  0.001 seconds
       1 len(list(uniqueCombinations(ls,k))) :  3.619 seconds
       1 len(set(combinations(sorted(ls),k))):  2.592 seconds

以上时间说明了两个极端:没有重复,只有重复.所有其他时间都在这两者之间.

Above timings illustrate the two extremes: no duplicates and only duplicates. All other timings are between this two.

我对以上结果的解释是,纯Python函数(没有itertools或其他C编译模块)可能会非常快,但也可能会慢得多,具体取决于列表中有多少个重复项.因此,可能无法为提供所需功能的Python .so扩展模块编写C ++代码.

My interpretation of the results above is that a pure Python function (no itertools or other C-compiled modules) can be extremely faster, but it can be also much slower depending on how many duplicates are in a list. So there is probably no way around writing C++ code for a Python .so extension module providing the required functionality.

推荐答案

您可以对输入列表进行预处理,而不是对输出进行后处理/过滤.这样,您可以避免首先生成重复项.预处理涉及对输入进行排序(或在输入上使用collections.Counter).一种可能的递归实现是:

Instead of post-processing/filtering your output, you can pre-process your input list. This way, you can avoid generating duplicates in the first place. Pre-processing involves either sorting (or using a collections.Counter on) the input. One possible recursive realization is:

def subbags(bag, k):
    a = sorted(bag)
    n = len(a)
    sub = []

    def index_of_next_unique_item(i):
        j = i + 1

        while j < n and a[j] == a[i]:
            j += 1

        return j

    def combinate(i):
        if len(sub) == k:
            yield tuple(sub)
        elif n - i >= k - len(sub):
            sub.append(a[i])
            yield from combinate(i + 1)
            sub.pop()
            yield from combinate(index_of_next_unique_item(i))

    yield from combinate(0)

bag = [1, 2, 3, 1, 2, 1]
k = 3
i = -1

print(sorted(bag), k)
print('---')

for i, subbag in enumerate(subbags(bag, k)):
    print(subbag)

print('---')
print(i + 1)

输出:

[1, 1, 1, 2, 2, 3] 3
---
(1, 1, 1)
(1, 1, 2)
(1, 1, 3)
(1, 2, 2)
(1, 2, 3)
(2, 2, 3)
---
6

为递归需要一些堆栈空间,但是这种对输入进行排序的操作所使用的时间和内存要比生成和丢弃重复项的时间少得多.

Requires some stack space for the recursion, but this + sorting the input should use substantially less time + memory than generating and discarding repeats.

这篇关于快速独特的组合(从具有重复项的列表中)而无需查找的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆