设计外部存储器排序算法 [英] Designing an external memory sorting algorithm

查看:103
本文介绍了设计外部存储器排序算法的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

如果我有一个很大的列表存储在需要排序的外部存储器中. Asumimg该列表对于内部存储器而言太大了,在设计外部排序算法时应考虑哪些主要因素?

If I have a very large list stored in external memory that needs to be sorted. Asumimg this list is too large for internal memory, what major factors should be considered in designing an external sorting algorithm?

推荐答案

在构建自己的外部排序之前,您可能会先看一下操作系统提供的工具. Windows具有SORT.EXE,尽管它具有...特质,但在某些文本文件上也能很好地工作. GNU排序也很好.您可以尝试使用其中任何一个数据来查看它们是否可以满足您的需求.

Before you go building your own external sort, you might look at the tools your operating system supplies. Windows has SORT.EXE, which works well enough on some text files, although it has ... idiosyncrasies. The GNU sort, too, works pretty well. You could give either of those a try on a subset of your data to see if they'll do what you need.

否则. .

外部排序是一种非常知名的算法.总体思路:

The external sort is a pretty well-known algorithm. The general idea:

  1. 将尽可能多的数据加载到内存中.
  2. 对该块进行排序.
  3. 将该块写入外部存储器.
  4. 重复步骤1-3,直到所有块都已排序并存储.
  5. 合并排序的块.

假设您有n个项目,每个项目分别分成km个元素的块(so n = k*m),则第一部分(步骤1-4)花费的时间与k *(m log m ).

Assuming you have n items that are separated into k blocks of m elements each (so n = k*m), the first part (steps 1-4) takes time proportional to k*(m log m).

完成步骤1-4后,您已对km个项目的块进行了排序(或者可能是k-1m个项目的块,还有一个项目的数量略少的块).或者,如果您要对字符串进行排序,则您有k个块,它们的大小大致相同,但是每个块中的字符串数会有所不同.

After completing steps 1-4, you have k sorted blocks of m items (or possibly k-1 blocks of m items, and one block that has somewhat fewer items). Or, if you're sorting strings, you have k blocks that are approximately the same size, but the number of strings in each block will vary.

您现在需要合并这些排序的块.典型的方法是使用 k路合并.

You now need to merge these sorted blocks. The typical way to do that is with a k-way merge.

您创建一个最小堆,其中包含每个块中的第一项.然后,从堆中选择根项,该根项是所有块中最小的项.您将其输出为第一项.然后,从最小的块中读取下一个项目,并将其放在堆上.那就是:

You create a min-heap that contains the first item from each block. You then select the root item from the heap, which is the smallest item of all the blocks. You output that as the first item. Then, read the next item from the block that the smallest came from, and place it on the heap. That is:

create heap
for each block
    read item and add to heap
end for

while heap is not empty
    remove smallest item from heap
    write to output
    read next item from block that contained smallest item
    add to heap
end while

算法的这一部分是O(n log k),其中n是项目总数,k是块数.

This part of the algorithm is O(n log k), where n is the total number of items and k is the number of blocks.

正如其他人指出的那样,有效的外部排序的一个关键是减少I/O.外部存储速度.我上面描述的算法执行的I/O越少越好.每一项从外部存储器读取两次,每项两次写入外部存储器.乍看之下,其他看起来更简单或更快速的算法最终在处理真实数据时会变得慢得多,因为它们花费太多时间进行I/O.

As somebody else pointed out, one key to efficient external sorting is to reduce I/O. External storage is slow. The algorithm I described above does as little I/O as possible. Each item is read from external storage twice, and each item is written to external storage twice. Other algorithms that look simpler or faster at first blush end up being much slower when working with real data because they spend too much time doing I/O.

如果您对实现感兴趣,我早些时候写了一系列有关

If you're interested in an implementation, I wrote a series of articles some time back about sorting a very large text file. The code is C#, but the description should allow you to translate to any language with little trouble.

这篇关于设计外部存储器排序算法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆