从硬盘排序大量整数 [英] Sorting huge Number of Integers from hard disk

查看:59
本文介绍了从硬盘排序大量整数的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

给出硬盘上的100 GB整数数据(RAM总计为2 GB),如何以最少的磁盘操作对整数进行排序。在这里,从磁盘中获取一个数字被视为一个磁盘操作(尽管实际上可以获取一个数据块)。

Given 100 GB integer Data on Hard Disk with RAM amounting to 2 GB, how to sort the integers with minimal disk operation. Here fetching one number from disk is considered as one disk operation( though in reality a block of data can be fetched).

我们可以在磁盘上使用额外的空间来临时存储

We can use additional space on the disk for temporary storage and no need to consider the operations of cleaning up temporary spaces used.

推荐答案

正如其他人所指出的,您可以使用O(n)计数排序。但是,您还需要考虑其他一些问题。我们假设您存储的是32位整数,所以100GB〜27e9整数。

As other people have noted, you can use an O(n) counting sort. However there are some additional problems you need to consider. We'll assume you're storing 32-bit integers, so 100GB ~ 27e9 ints.

如果所有整数都相同,那么它将出现〜27e9次,大于32位int。因此,您的计数器必须是64位整数。

If all the integers are the same, then it will occur ~27e9 times, which is larger than a 32 bit int. Therefore your counters will have to be 64-bit integers.

有了2GB的RAM,您一次只能在RAM中存储约125e6个计数器。如果我们不能对整数的分布做任何假设,我们要么必须:

With 2GB of RAM, you can only store ~125e6 counters in RAM at any one time. If we can't make any assumptions about the distribution of integers, we would either have to:


  • 分别增加HD上的计数器,或

  • 忽略我们遇到的所有不在当前存储在RAM中的计数器数组中的整数。

我认为后一种选择更好。由于我们需要约4e9个64位计数器,并且只能存储2GB,因此,我们需要在整个阵列中运行约16次。如果我们考虑遇到整数序列(例如0,1≤31,0),则第一种选择显然不好。这些计数器不会同时存储在RAM中,因此至少需要进行2次HD写操作。

I think the latter option is better. Since we need ~4e9 64-bit counters and can only store 2GB, we would need to run through the entire array ~16 times. The first option is clearly no good if we consider encountering a sequence of integers such as 0,1<<31,0. These counters would not be stored in RAM at the same time, and thus at least 2 HD writes are required.

因此,我认为对于您的特定大小问题(100GB), N向合并会更好,因为这只需阅读整个数组log_2(100)〜8次。

Because of this, I think for the specific size of your problem (100GB), an N-way merge sort would be better, as this would only require reading the entire array log_2(100) ~ 8 times.

但是,如果访问者立即将问题更改为 10TB数组,仍为2GB RAM,则进行计数排序会轻松获胜。

However, if the interviewer immediately changed the question to "10TB array, still 2GB of RAM", then the counting sort would easily win.

这篇关于从硬盘排序大量整数的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆