如何从笛卡尔积中采样而无需重复 [英] How to sample from Cartesian product without repetition

查看:277
本文介绍了如何从笛卡尔积中采样而无需重复的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个集合列表,我希望对n个不同的样本进行采样,每个样本都包含每个集合中的一个项目. 我不想要按顺序排列它,因此,例如,我将从第一组中获取所有必定具有相同项目的样本.我也不想创建所有笛卡尔积,因为就效率而言,这可能是不可能的... 对如何做有任何想法吗?甚至是近似于这种行为的东西?

I have a list of sets, and I wish to sample n different samples each containing an item from each set. What I do not want is to have it in order, so, for example, I will get all the samples necessarily with the same item from the first set. I also don't want to create all the Cartesian products as that might not be possible in terms of efficiency... Any idea of how to do it? Or even something to approximate this behaviour?

不起作用的示例:

(prod for i, prod in zip(range(n), itertools.product(*list_of_sets)))

推荐答案

上述所有解决方案在迭代结束时都会浪费大量资源来过滤重复的结果.这就是为什么我想到的方法从开始到结束都具有(几乎)线性速度.

All the above solutions waste a lot of resources for filtering repeated results when it comes to the end of the iteration. That's why I have thought of a method that has (almost) linear speed from start until the very end.

这个想法是:(仅在您的头脑中)为标准订单笛卡尔积的每个结果提供一个索引.例如,对于A x B x C2000 x 1 x 2 = 4000个元素:

The idea is: Give (only in your head) each result of the standard order cartesian product an index. That would be for example for AxBxC with 2000x1x2 = 4000 elements:

0: (A[0], B[0], C[0])
1: (A[1], B[0], C[0])
...
1999: (A[1999], B[0], C[0])
2000: (A[0], B[0], C[1])
...
3999: (A[1999], B[0], C[1])
done.

因此,还有一些问题尚待解决:

So there are still some questions open:

  • 如何获取可能的索引列表? 答案:只需将2000*1*2=4000相乘,下面的每个数字都是有效索引.
  • 如何顺序生成随机索引而不重复? 有两个答案:如果要使用已知样本大小为n的样本,只需使用random.sample(xrange(numer_of_indices), n).但是,如果您还不知道样本大小(更一般的情况),则必须即时生成索引,以免浪费内存.在这种情况下,您可以使用k = numer_of_indices生成index = random.randint(0, k - 1)以获得第一个索引,并为第n个结果生成k = number_of_indices - n.只需检查下面的代码(请注意,我在此处使用了一个单侧链表来存储已完成的索引.它使插入操作成为O(1)运算,并且我们需要在此处进行很多插入操作).
  • 如何从索引生成输出? 答案:好,说我们的索引是i.然后i % 2000将是结果的A索引.现在,i // 2000可以递归地视为剩余因子的笛卡尔乘积的索引.
  • How do I get a list of possible indices? Answer: Just multiply 2000*1*2=4000 and every number below that will be a valid index.
  • How do I generate random indices sequentially without repetition? There are two answers: If you want samples with a known sample size n, just use random.sample(xrange(numer_of_indices), n). But if you don't know the sample size yet (more general case), you have to generate indices on the fly to not waste memory. In that case, you can just generate index = random.randint(0, k - 1) with k = numer_of_indices to get the first index and k = number_of_indices - n for the nth result. Just check my code below (be aware, that I use a one sided linked list there to store the done indices. It makes insert operations O(1) operations and we need a lot of insertions here).
  • How do I generate the output from the index? Answer: Well, say our index is i. Then i % 2000 will be the index of A for the result. Now i // 2000 can be treated recursively as the index for the cartesian product of the remaining factors.

这是我想出的代码:

def random_order_cartesian_product(*factors):
    amount = functools.reduce(lambda prod, factor: prod * len(factor), factors, 1)
    index_linked_list = [None, None]
    for max_index in reversed(range(amount)):
        index = random.randint(0, max_index)
        index_link = index_linked_list
        while index_link[1] is not None and index_link[1][0] <= index:
            index += 1
            index_link = index_link[1]
        index_link[1] = [index, index_link[1]]
        items = []
        for factor in factors:
            items.append(factor[index % len(factor)])
            index //= len(factor)
        yield items

这篇关于如何从笛卡尔积中采样而无需重复的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆