编写一个程序,从 10 亿个数字的数组中找出 100 个最大的数字 [英] Write a program to find 100 largest numbers out of an array of 1 billion numbers

查看:26
本文介绍了编写一个程序,从 10 亿个数字的数组中找出 100 个最大的数字的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我最近参加了一次采访,有人问我编写一个程序,从 10 亿个数字中找出 100 个最大的数字."

I recently attended an interview where I was asked "write a program to find 100 largest numbers out of an array of 1 billion numbers."

我只能给出一个蛮力解决方案,即以 O(nlogn) 时间复杂度对数组进行排序并取最后 100 个数字.

I was only able to give a brute force solution which was to sort the array in O(nlogn) time complexity and take the last 100 numbers.

Arrays.sort(array);

面试官正在寻找更好的时间复杂度,我尝试了其他几种解决方案但未能回答他.有没有更好的时间复杂度解决方案?

The interviewer was looking for a better time complexity, I tried a couple of other solutions but failed to answer him. Is there a better time complexity solution?

推荐答案

你可以保持一个最大的 100 个数字的优先队列,迭代十亿个数字,每当你遇到一个数字大于队列中的最小数字(队头),移除队头并将新编号加入队列.

You can keep a priority queue of the 100 biggest numbers, iterate through the billion numbers, whenever you encounter a number greater than the smallest number in the queue (the head of the queue), remove the head of the queue and add the new number to the queue.

正如 Dev 所指出的,使用堆实现优先级队列,插入队列的复杂性是 O(log N)

As Dev noted, with a priority queue implemented with a heap, the complexity of insertion to queue is O(log N)

在最坏的情况下,你得到 billion*log2(100) 这比 billion*log2(billion)

In the worst case you get billion*log2(100) which is better than billion*log2(billion)

一般来说,如果你需要一组 N 个数字中最大的 K 个数字,复杂度是 O(N log K) 而不是 O(N log N)>,当 K 与 N 相比非常小时,这可能非常重要.

In general, if you need the largest K numbers from a set of N numbers, the complexity is O(N log K) rather than O(N log N), this can be very significant when K is very small comparing to N.

该算法的预期时间非常有趣,因为在每次迭代中可能会或可能不会发生插入.第 i 个数字被插入队列的概率是一个随机变量大于至少 iK 个来自同一分布的随机变量的概率(前 k 个数字被自动添加到队列).我们可以使用订单统计数据(参见 link)来计算这个概率.例如,假设这些数字是从 {0, 1} 中随机统一选择的,第 (iK) 个数字(在 i 个数字中)的期望值为 (ik)/i,随机变量大于这个值的概率是1-[(ik)/i] = k/i.

The expected time of this algorithm is pretty interesting, since in each iteration an insertion may or may not occur. The probability of the i'th number to be inserted to the queue is the probability of a random variable being larger than at least i-K random variables from the same distribution (the first k numbers are automatically added to the queue). We can use order statistics (see link) to calculate this probability. For example, lets assume the numbers were randomly selected uniformly from {0, 1}, the expected value of (i-K)th number (out of i numbers) is (i-k)/i, and chance of a random variable being larger than this value is 1-[(i-k)/i] = k/i.

因此,预期的插入次数为:

Thus, the expected number of insertions is:

而期望的运行时间可以表示为:

And the expected running time can be expressed as:

(k 用第一个 k 元素生成队列的时间,然后是 nk 比较,以及如上所述的预期插入次数, 每个都需要平均 log(k)/2 时间)

(k time to generate the queue with the first k elements, then n-k comparisons, and the expected number of insertions as described above, each takes an average log(k)/2 time)

请注意,当 NK 相比非常大时,此表达式更接近 n 而不是 N logK.这有点直观,就像问题的情况一样,即使经过 10,000 次迭代(与 10 亿相比非常小),将数字插入队列的机会也非常小.

Note that when N is very large comparing to K, this expression is a lot closer to n rather than N log K. This is somewhat intuitive, as in the case of the question, even after 10,000 iterations (which is very small comparing to a billion), the chance of a number to be inserted to the queue is very small.

这篇关于编写一个程序,从 10 亿个数字的数组中找出 100 个最大的数字的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆