面试之谜:排序内存有限的一百万数字输入 [英] Interview puzzle: Sorting a million number input with limited memory

查看:117
本文介绍了面试之谜:排序内存有限的一百万数字输入的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我试着回答这个使用外部排序,但面试官回答说,复杂性是很高ン(的log(n))例如:N平方* LOGN。 有没有更好的选择。

I tried answering this using external sort, but interviewer replied that the complexity was to high n.n(log(n)) i.e. n square *logn. Is there a better alternative.

要简化问题: 让我们假设我们有1000个元素与分配给只有1​​00元的空间来进行排序。是什么,将采取较小的时间比外部排序的最佳算法。

To simplify the question: Lets suppose we have 1000 elements to sort with space allocated for 100 elements only. What is the best algorithm that will take lesser time than the external sort.

推荐答案

我不知道哪个外部排序,你(或面试)的意思,但

I don't know which external sort you (or the interviewer) meant, but

我的建议是10路(你的情况)合并:

my suggestion is a 10-way (in your case) merge:

  • 将文件分割成max_mem的大小的块(100元)
    • 这是 O(1)
    • Split the file into chunks of MAX_MEM size (100 elements)
      • this is O(1)
      • 这是 O((N / max_mem的)*(max_mem的)日志(max_mem的))) = O(N日志(max_mem的))
      • this is O((n/max_mem) * (max_mem) log(max_mem))) = O(n log(max_mem))
      • 这是 O(N日志(N / max_mem的))使用minHeap或为O(n ^ 2 / max_mem的)平凡(可能会快一些在实践中)
      • this is O(n log(n/max_mem)) using a minHeap or O(n^2/max_mem) trivially (may be faster in practice)

      关于运算,这是为O(n(日志(max_mem的)+的log(n / max_mem的))) = O(N日志( N))

      Concerning computation, this is O(n (log(max_mem)+log(n/max_mem)))=O(n log(n))

      关于磁盘I / O,如果所有的合并是在一传做的,这是 2 * N 读取和 2 * N 写的只有的。 更一般地,它的(1+ [合并树的深度])* N

      Concerning disk I/O, if all merging is done in one pass, this is 2*n reads and 2*n writes only. More generally, it's (1+[depth of the merge tree])*n

      所有的写操作是连续的。 第一读取是顺序,第二个是连续的,从10个文件交织。

      All writes are sequential. The first read is sequential, the second one is sequential, interleaved from 10 files.

      如果有更多的数据,你需要重复或递归的合并(100块,然后选择Ñ块重复)。在这一点上,它的价值取代了分裂+排序与置换/选择步骤如@阿密特的回答说明,特别是当数据已经差不多排序(你可以逃避合成步骤完全)。

      If there was much more data, you'd need repeated or recursive merge (100 per chunk, then pick N chunks repeatedly). At this point, it's worth replacing the split+sort step with Replacement/Selection as described in the @amit's answer, especially when the data is already almost sorted (you may evade the merging step completely).

      请注意,较高的N可以提高计算(非常轻微,如果你使用正确的结构),但减少磁盘的I / O量显著(达到一定的量;如果一次合并了太多的块,你可以耗尽内存的读取缓冲区,造成不必要的读取)。磁盘I / O是昂贵的,CPU周期都没有。

      Note that higher N may increase computation (very slightly, if you use the right structures), but reduces the amount of disk I/O significantly (up to a certain amount; if you merge too many chunks at once, you may run out of memory for the read buffers, causing unneccessary reads) . Disk I/O is expensive, CPU cycles are not.

      这篇关于面试之谜:排序内存有限的一百万数字输入的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆