如何从笛卡尔积中采样而无需重复 [英] How to sample from Cartesian product without repetition
问题描述
我有一个集合列表,我希望对n个不同的样本进行采样,每个样本都包含每个集合中的一个项目. 我不想要按顺序排列它,因此,例如,我将从第一组中获取所有必定具有相同项目的样本.我也不想创建所有笛卡尔积,因为就效率而言,这可能是不可能的... 对如何做有任何想法吗?甚至是近似于这种行为的东西?
I have a list of sets, and I wish to sample n different samples each containing an item from each set. What I do not want is to have it in order, so, for example, I will get all the samples necessarily with the same item from the first set. I also don't want to create all the Cartesian products as that might not be possible in terms of efficiency... Any idea of how to do it? Or even something to approximate this behaviour?
不起作用的示例:
(prod for i, prod in zip(range(n), itertools.product(*list_of_sets)))
推荐答案
上述所有解决方案在迭代结束时都会浪费大量资源来过滤重复的结果.这就是为什么我想到的方法从开始到结束都具有(几乎)线性速度.
All the above solutions waste a lot of resources for filtering repeated results when it comes to the end of the iteration. That's why I have thought of a method that has (almost) linear speed from start until the very end.
这个想法是:(仅在您的头脑中)为标准订单笛卡尔积的每个结果提供一个索引.例如,对于A
x B
x C
和2000
x 1
x 2
= 4000
个元素:
The idea is: Give (only in your head) each result of the standard order cartesian product an index. That would be for example for A
xB
xC
with 2000
x1
x2
= 4000
elements:
0: (A[0], B[0], C[0])
1: (A[1], B[0], C[0])
...
1999: (A[1999], B[0], C[0])
2000: (A[0], B[0], C[1])
...
3999: (A[1999], B[0], C[1])
done.
因此,还有一些问题尚待解决:
So there are still some questions open:
- 如何获取可能的索引列表? 答案:只需将
2000*1*2=4000
相乘,下面的每个数字都是有效索引. - 如何顺序生成随机索引而不重复? 有两个答案:如果要使用已知样本大小为
n
的样本,只需使用random.sample(xrange(numer_of_indices), n)
.但是,如果您还不知道样本大小(更一般的情况),则必须即时生成索引,以免浪费内存.在这种情况下,您可以使用k = numer_of_indices
生成index = random.randint(0, k - 1)
以获得第一个索引,并为第n
个结果生成k = number_of_indices - n
.只需检查下面的代码(请注意,我在此处使用了一个单侧链表来存储已完成的索引.它使插入操作成为O(1)运算,并且我们需要在此处进行很多插入操作). - 如何从索引生成输出? 答案:好,说我们的索引是
i
.然后i % 2000
将是结果的A
索引.现在,i // 2000
可以递归地视为剩余因子的笛卡尔乘积的索引.
- How do I get a list of possible indices? Answer: Just multiply
2000*1*2=4000
and every number below that will be a valid index. - How do I generate random indices sequentially without repetition? There are two answers: If you want samples with a known sample size
n
, just userandom.sample(xrange(numer_of_indices), n)
. But if you don't know the sample size yet (more general case), you have to generate indices on the fly to not waste memory. In that case, you can just generateindex = random.randint(0, k - 1)
withk = numer_of_indices
to get the first index andk = number_of_indices - n
for then
th result. Just check my code below (be aware, that I use a one sided linked list there to store the done indices. It makes insert operations O(1) operations and we need a lot of insertions here). - How do I generate the output from the index? Answer: Well, say our index is
i
. Theni % 2000
will be the index ofA
for the result. Nowi // 2000
can be treated recursively as the index for the cartesian product of the remaining factors.
这是我想出的代码:
def random_order_cartesian_product(*factors):
amount = functools.reduce(lambda prod, factor: prod * len(factor), factors, 1)
index_linked_list = [None, None]
for max_index in reversed(range(amount)):
index = random.randint(0, max_index)
index_link = index_linked_list
while index_link[1] is not None and index_link[1][0] <= index:
index += 1
index_link = index_link[1]
index_link[1] = [index, index_link[1]]
items = []
for factor in factors:
items.append(factor[index % len(factor)])
index //= len(factor)
yield items
这篇关于如何从笛卡尔积中采样而无需重复的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!