排序数据大于RAM大小 [英] Sorting Data larger than the RAM size

查看:172
本文介绍了排序数据大于RAM大小的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

这是一个Google面试问题:
给定2台机器,每台机器都有64 GB RAM,其中包含所有的整数(8字节),对整个128 GB数据进行排序。您可以假设少量额外的RAM。扩展它以对1000台机器中存储的数据进行排序。

This is a Google interview question: Given 2 machines, each having 64 GB RAM, containing all integers (8 byte), sort the entire 128 GB data. You may assume a small amount of additional RAM. Extend this to sort data stored in 1000 machines.

我想出了外部排序。因为我们将整个数据划分成块,对它们使用合并排序。这是第一个排列的块,把它们放回来,让它们再次成为一个明智的并合并它们。有没有更好的办法?

I came up with external sort. In that we divide the entire data into chunks and use merge sort on them. That is the first sort the chunks and put them back and get them again piece wise and merge them. Is there a better way? What would be the complexity?

推荐答案

ChingPing为每个子集提出了一个O(n log n)排序,其次是线性合并(通过交换元素)。 Quicksort的问题(和大多数n log n的排序是,它们需要n个内存,我建议使用仍然运行在O(n log n)中。

ChingPing proposes a O(n log n) sort for each subset, followed by a linear merge (by swapping the elements). The problem with Quicksort (and most of the n log n sorts, is that they require n memory. I'd recommend instead using a SmoothSort which uses constant memory, still runs in O(n log n).

最糟糕的情况是你的地址如下: / p>

The worst case scenario is where you have something like:

setA = [maxInt .. 1]
setB = [0..minInt]

其中两个集合都是相反的顺序排列,但合并的顺序相反。

where both sets are ordered in reverse, but then the merger is in the reverse order.

对清平解决方案的(IMO - 更清晰)解释是:

The (IMO - more clear) explanation of ChingPing's solution is:

Have a pointers 'pointerA', 'pointerB' initialized at the beginning of each array
While setA's pointer is not at the end
  if (setA[pointerA] < setB[pointerB])
    then { pointerA++; }
    else { swap(setA[pointerA], setB[pointerB]); pointerB++; }

现在都应该排序。

这篇关于排序数据大于RAM大小的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆