如何有效地产生一组唯一的随机数字与一个pdefined $ P $分布? [英] How to efficiently generate a set of unique random numbers with a predefined distribution?

查看:116
本文介绍了如何有效地产生一组唯一的随机数字与一个pdefined $ P $分布?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个项目以某种概率分布图:

I have a map of items with some probability distribution:

Map<SingleObjectiveItem, Double> itemsDistribution;

由于某种 M 我要生成一个设置 M 元素从上面的分布采样。

Given a certain m I have to generate a Set of m elements sampled from the above distribution.

截至目前我是用做它用简单的方式:

As of now I was using the naive way of doing it:

while(mySet.size < m)
   mySet.add(getNextSample(itemsDistribution));

getNextSample(...)方法获取从分布根据其概率的对象。现在, M 提高性能严重受到影响。对于 M = 500 itemsDistribution.size()= 1000 元素,有太多的颠簸和功能保持在while循环时间过长。生成1000套这样的,你有爬的应用程序。

The getNextSample(...) method fetches an object from the distribution as per its probability. Now, as m increases the performance severely suffers. For m = 500 and itemsDistribution.size() = 1000 elements, there is too much thrashing and the function remains in the while loop for too long. Generate 1000 such sets and you have an application that crawls.

有没有更有效的方法来生成一组唯一的随机数与一个predefined分配?最收集改组技术等是均匀随机的。什么将是解决这一问题的好办法?

Is there a more efficient way to generate a unique set of random numbers with a "predefined" distribution? Most collection shuffling techniques and the like are uniformly random. What would be a good way to address this?

更新:循环将调用 getNextSample(...)至少 1 + 2 + 3 + ...米=米第(m + 1)/ 2 次。这是在第一次运行时,我们一定会获得一组样本。在第二迭代中,它可以被称为的至少两倍,并依此类推。如果 getNextSample 是顺序性的,也就是说,会遍历整个累积分布找到样本,然后循环的运行时间复杂度至少是: n * m个第(m + 1)/ 2 ,n是在分配元件的数量。如果 M = CN; 0℃℃下= 1 则循环至少西格玛(N ^ 3)。而这也就是下界!

UPDATE: The loop will call getNextSample(...) "at least" 1 + 2 + 3 + ... + m = m(m+1)/2 times. That is in the first run we'll definitely get a sample for the set. The 2nd iteration, it may be called at least twice and so on. If getNextSample is sequential in nature, i.e., goes through the entire cumulative distribution to find the sample, then the run time complexity of the loop is at least: n*m(m+1)/2, 'n' is the number of elements in the distribution. If m = cn; 0<c<=1 then the loop is at least Sigma(n^3). And that too is the lower bound!

如果我们通过二进制搜索替换顺序搜索,复杂性将至少西格玛(日志N * N ^ 2)。有效的,但可能不被大比分。

If we replace sequential search by binary search, the complexity would be at least Sigma(log n * n^2). Efficient but may not be by a large margin.

另外,因为我所说的上面的循环 K 倍,产生 K 从分发中删除是不可能的这种集。这些成套项目中随机抽取的时间表的一部分。因此,一个项目的设置。

Also, removing from the distribution is not possible since I call the above loop k times, to generate k such sets. These sets are part of a randomized 'schedule' of items. Hence a 'set' of items.

推荐答案

现在的问题是不太可能您展示循环:

The problem is unlikely to be the loop you show:

设n是分布的大小,以及我是调用到getNextSample的数目。我们有I = sum_i(C_I),其中C_I是调用到getNextSample数量,同时设置有大小我。为了找到E [C_I],观察C_I是泊松过程用的到达间隔时间λ= 1 - I / N,因此指数分布与λ。因此,E [C_I] = 1 /λ=因此E [C_I] = 1 /(1 - I / N)&LT; = 1 /(1 - M / N)。因此,E [1] - ;米/(1 - M / N)。

Let n be the size of the distribution, and I be the number of invocations to getNextSample. We have I = sum_i(C_i), where C_i is the number of invocations to getNextSample while the set has size i. To find E[C_i], observe that C_i is the inter-arrival time of a poisson process with λ = 1 - i / n, and therefore exponentially distributed with λ. Therefore, E[C_i] = 1 / λ = therefore E[C_i] = 1 / (1 - i / n) <= 1 / (1 - m / n). Therefore, E[I] < m / (1 - m / n).

即,抽样一组大小为m = n / 2个将采取的平均值,小于2m =正getNextSample的调用。如果是慢和爬时,它很可能是因为getNextSample是缓慢的。这实际上是不足为奇,鉴于distrubution被传递给该方法的不合适的方式(因为该方法将,必要时,必须遍历整个分布找到一个随机元素)。

That is, sampling a set of size m = n/2 will take, on average, less than 2m = n invocations of getNextSample. If that is "slow" and "crawls", it is likely because getNextSample is slow. This is actually unsurprising, given the unsuitable way the distrubution is passed to the method (because the method will, of necessity, have to iterate over the entire distribution to find a random element).

下面列出的是更快(如M&LT; 0.8 N)

The following should be faster (if m < 0.8 n)

class Distribution<T> {
    private double[] cummulativeWeight;
    private T[] item;
    private double totalWeight;

    Distribution(Map<T, Double> probabilityMap) {
        int i = 0;

        cummulativeWeight = new double[probabilityMap.size()];
        item = (T[]) new Object[probabilityMap.size()];

        for (Map.Entry<T, Double> entry : probabilityMap.entrySet()) {
            item[i] = entry.getKey();
            totalWeight += entry.getValue();
            cummulativeWeight[i] = totalWeight;
            i++;
        }
    }

    T randomItem() {
        double weight = Math.random() * totalWeight;
        int index = Arrays.binarySearch(cummulativeWeight, weight);
        if (index < 0) {
            index = -index - 1;
        }
        return item[index];
    }

    Set<T> randomSubset(int size) {
        Set<T> set = new HashSet<>();
        while(set.size() < size) {
            set.add(randomItem());
        }
        return set;
    }
}



public class Test {

    public static void main(String[] args) {
        int max = 1_000_000;
        HashMap<Integer, Double> probabilities = new HashMap<>();
        for (int i = 0; i < max; i++) {
            probabilities.put(i, (double) i);
        }

        Distribution<Integer> d = new Distribution<>(probabilities);
        Set<Integer> set = d.randomSubset(max / 2);
        //System.out.println(set);
    }
}

预计运行时间为O(M /(1 - M / N)* log n)的。在我的电脑,一套1_000_000大小500_000的一个子集计算在约3秒。

The expected runtime is O(m / (1 - m / n) * log n). On my computer, a subset of size 500_000 of a set of 1_000_000 is computed in about 3 seconds.

正如我们所看到的,预期的运行时间趋近于无穷大为m靠近n。如果这是一个问题(即M> 0.9 N),下面的更复杂的方法应该更好地工作:

As we can see, the expected runtime approaches infinity as m approaches n. If that is a problem (i.e. m > 0.9 n), the following more complex approach should work better:

Set<T> randomSubset(int size) {
    Set<T> set = new HashSet<>();
    while(set.size() < size) {
        T randomItem = randomItem();
            remove(randomItem); // removes the item from the distribution
            set.add(randomItem);
    }
    return set;
}

要有效地实现删除需要不同的重新presentation的分布,例如二进制树,其中每个节点存储,其根是子树的总重量

To efficiently implement remove requires a different representation for the distribution, for instance a binary tree where each node stores the total weight of the subtree whose root it is.

但是,这是相当复杂的,所以我不会走这条路如果M被称为是大于n显著小。

But that is rather complicated, so I wouldn't go that route if m is known to be significantly smaller than n.

这篇关于如何有效地产生一组唯一的随机数字与一个pdefined $ P $分布?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆