高效外的核心排序 [英] Efficient Out-Of-Core Sorting

查看:148
本文介绍了高效外的核心排序的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我试图找出如何有效巨大的数据集,将不适合在内存中进行排序。在较高的水平答案显然是整理一大堆那些使用一些标准算法适合在内存块中,写这些到磁盘上,然后将它们合并。合并他们的问题。

I'm trying to work out how to efficiently sort a huge dataset that won't fit in memory. The obvious answer at a high level is to sort a whole bunch of chunks that do fit in memory using some standard algorithm, write these out to disk, and then merge them. Merging them is the problem.

比方说,数据划分了成C块,所以我有C文件合并。如果我做了C-路合并在一通,然后在技术上我有一个O(N ^ 2)算法,但一个只需要执行O(N)写入磁盘。如果我反复地把它们合并成C / 2文件,那么C / 4文件等。然后,我有一个O(N日志N)的算法,而是一个必须执行O(N日志N)写入磁盘,因此具有一个巨大常数项。

Let's say the data divides up into C chunks, so I have C files to merge. If I do a C-way merge in one pass, then technically I have an O(N^2) algorithm, though one that only has to perform O(N) writes to disk. If I iteratively merge them into C/2 files, then C/4 files, etc. then I have an O(N log N) algorithm, but one that has to perform O(N log N) writes to disk, and therefore has a huge constant term.

什么是典型的解决这个难题?有没有什么好的?

What is the typical solution to this conundrum? Is there any good one?

推荐答案

简单的答案是,有没有简单的回答这个问题。有很多答案,其中大多数是相当复杂 - 克努特第3卷(一个例子)致力于一个很大的空间,它

The simple answer is that there is no simple answer to this question. There are lots of answers, most of them fairly complex -- Knuth volume 3 (for one example) devotes a great deal of space to it.

有一件事情是通过什么一直在做看时变得很明显的是,你的真正的希望尽量减少您在最初排序过程中创建的文件的数量,并最大限度地发挥各家之长。要做到这一点,你一般要读入尽可能多的数据,你可以装入内存,但不是只是排序它,写出来,你想要把它变成了一堆。然后,当你写的每一个记录了,你看在另一个记录,并把它变成你的堆。当你写从堆到文件中的每个后续记录,检查它是否比现有的记录数。如果没有,你从堆中删除它,然后将其插入到另一个堆。然后继续在第一堆中的下一个最小的记录。你停止把记录到当前文件时先堆完全是空的,和你的第二个堆占用了所有的记忆。在这一点上,你开始把记录到一个新的文件,基本上是交换的两个堆的使用。

One thing that becomes obvious when looking through what's been done is that you really want to minimize the number of files you create during your initial sorting, and maximize the length of each. To do that, you generally want to read in about as much data as you can fit in memory, but instead of just sorting it and writing it out, you want to put it into a heap. Then as you write each record out, you read IN another record and put it into your heap. As you write each subsequent record from the heap to the file, you check whether it's larger than the existing records. If not, you remove it from the heap, and insert it into another heap. Then continue with the next smallest record in the first heap. You stop putting records into the current file when the first heap is completely empty, and your second heap is taking up all your memory. At that point, you start putting records into a new file, and basically "swap" the uses of the two heaps.

这会产生相当长的中间文件在初始阶段,因此合并它们基本上是少的工作。

This will produce considerably longer intermediate files in the initial phase, so merging them is substantially less work.

编辑:当然,我并没有发明这个 - 我大概是第一次读到它在克努特,但也许在的算法+数据结构=程序的(尼克劳斯·维尔特) - 无论是讨论它。克努特归功于首次公布的方法,以H西沃德,在他的硕士论文在麻省理工学院于1954年。如果你有克努特的第二版,它的体积3的254页上的我从来没有得到解决,以得到一个副本第三版,所以我没有一个页码为。

I certainly didn't invent this -- I probably first read about it in Knuth, but perhaps in Algorithms + Data Structures = Programs (Niklaus Wirth) -- both discuss it. Knuth credits first publication of the method to "H. Seward", in his masters thesis at MIT in 1954. If you have the second edition of Knuth, it's on page 254 of volume 3. I've never gotten around to getting a copy of the third edition, so I don't have a page number for that.

这篇关于高效外的核心排序的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆