编写一个程序,发现100家大号码的开出1十亿数​​字数组 [英] Write a program to find 100 largest numbers out of an array of 1 billion numbers

查看:175
本文介绍了编写一个程序,发现100家大号码的开出1十亿数​​字数组的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我最近参加了一个访谈,我是问:写一个程序,找出100家大号码的开出1十亿的数字数组的。

I recently attended an interview where I was asked "write a program to find 100 largest numbers out of an array of 1 billion numbers."

我只能给一个蛮力解决方案,这是在O(nlogn)的时间复杂度的数组进行排序,并采取最后的100个号码。

I was only able to give a brute force solution which was to sort the array in O(nlogn) time complexity and take the last 100 numbers.

Arrays.sort(array);

面试官在寻找一个更好的时间复杂度,我试了几个其他的解决方案,但没有回答他。有没有更好的时间复杂度的解决方案?

The interviewer was looking for a better time complexity, I tried a couple of other solutions but failed to answer him. Is there a better time complexity solution?

推荐答案

您可以保留100最大号的优先级队列,遍历十亿的数字,只要您遇到队列中的数字大于最小号(头队列),取出队列的头部,并添加新的号码到队列中。

You can keep a priority queue of the 100 biggest numbers, iterate through the billion numbers, whenever you encounter a number greater than the smallest number in the queue (the head of the queue), remove the head of the queue and add the new number to the queue.

编辑: 作为开发指出,与堆实现优先级队列,插入的复杂性队列 O(logN)的

as Dev noted, with a priority queue implemented with a heap, the complexity of insertion to queue is O(logN)

在你最坏的情况下十亿*登录 2 (100)十亿*日志更好<子> 2 (十亿)

In the worst case you get billion*log2(100) which is better than billion*log2(billion)

在一般情况下,如果你从一组N个数字所需要的最大的K个,复杂度 O(NlogK),而不是 O( NlogN),这样做是很显著当K是非常小的比较为N。

In general, if you need the largest K numbers from a set of N numbers, the complexity is O(NlogK) rather than O(NlogN), this can be very significant when K is very small comparing to N.

EDIT2:

该算法的预期时间是pretty的有趣,因为在每一次迭代的插入可能会或可能不会发生。第i号被插入到队列的概率是一个随机变量的概率比至少 IK 从同一分布随机变量(第一k大号码会自动添加到队列)。我们可以用订单统计数据(见链接)来计算这个概率。例如,让我们假设的数字随机抽取选择均匀{0,1} ,(IK)日数(从我的数字)的预期值为(IK)/ I ,以及一个随机变量是大于该值的几率 1 - [(IK)/ I] = K / I

The expected time of this algorithm is pretty interesting, since in each iteration an insertion may or may not occur. The probability of the i'th number to be inserted to the queue is the probability of a random variable being larger than at least i-K random variables from the same distribution (the first k numbers are automatically added to the queue). We can use order statistics (see link) to calculate this probability. For example, lets assume the numbers were randomly selected uniformly from {0, 1}, the expected value of (i-K)th number (out of i numbers) is (i-k)/i, and chance of a random variable being larger than this value is 1-[(i-k)/i] = k/i.

因此​​,插入的预期数是:

Thus, the expected number of insertions is:

和预期的运行时间可以pssed作为前$ P $:

And the expected running time can be expressed as:

K 的时间来产生的第一个队列氏/ code>元素,那么 NK 比较,并插入如上所述的期望的数量,每一个平均需要日志(K)/ 2 时间)

(k time to generate the queue with the first k elements, then n-k comparisons, and the expected number of insertions as described above, each takes an average log(k)/2 time)

请注意,当 N 是非常大的比较 K ,这前pression是拉近了许多以 N ,而不是 NlogK 。这有点直观,如在问题的情况下,即使经过10000次迭代(这是非常小的相对于一个十亿),将被插入的一个数的机会,该队列是非常小的。

Note that when N is very large comparing to K, this expression is a lot closer to n rather than NlogK. This is somewhat intuitive, as in the case of the question, even after 10000 iterations (which is very small comparing to a billion), the chance of a number to be inserted to the queue is very small.

这篇关于编写一个程序,发现100家大号码的开出1十亿数​​字数组的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆