Java中大型数据集的基于文件的合并排序 [英] file based merge sort on large datasets in Java

查看:259
本文介绍了Java中大型数据集的基于文件的合并排序的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

鉴于大数据集不适合内存,是否有任何库或api在Java中执行排序?
实现可能类似于linux实用程序排序。

given large datasets that don't fit in memory, is there any library or api to perform sort in Java? the implementation would possibly be similar to linux utility sort.

推荐答案

Java提供了一个通用的排序例程,它可以用作问题的更大解决方案的一部分。对数据进行排序的常用方法是:太大而不适合内存:

Java provides a general-purpose sorting routine which can be used as part of the larger solution to your problem. A common approach to sort data that's too large to all fit in memory is this:

1)读取适合主内存的数据,假设它是1 Gb

1) Read as much data as will fit into main memory, let's say it's 1 Gb

2)1 Gb的Quicksort(这里是你在Collections框架中使用Java内置排序的地方)

2) Quicksort that 1 Gb (here's where you'd use Java's built-in sort from the Collections framework)

3)将已排序的1 Gb写入磁盘作为chunk-1

3) Write that sorted 1 Gb to disk as "chunk-1"

4)重复步骤1-3,直到您完成所有数据,将每个数据块保存在单独的文件中。因此,如果你的原始数据是9 Gb,你现在将有9个已排序的数据块,标记为chunk-1到chunk-9

4) Repeat steps 1-3 until you've gone through all the data, saving each data chunk in a separate file. So if your original data was 9 Gb, you will now have 9 sorted chunks of data labeled "chunk-1" thru "chunk-9"

5)你现在只是需要最终合并排序以将9个排序的块合并为单个完全排序的数据集。合并排序将对这些预先排序的块非常有效。它基本上将打开9个文件读取器(每个块一个),再加上一个文件写入器(用于输出)。然后,它比较每个读取文件中的第一个数据元素,并选择最小值,该值将写入输出文件。从中读取所选值的读取器前进到其下一个数据元素,并重复找到最小值的9向比较过程,再次将答案写入输出文件。重复此过程,直到从所有块文件中读取所有数据。

5) You now just need a final merge sort to merge the 9 sorted chunks into a single fully sorted data set. The merge sort will work very efficiently against these pre-sorted chunks. It will essentially open 9 file readers (one for each chunk), plus one file writer (for output). It then compares the first data element in each read file and selects the smallest value, which is written to the output file. The reader from which that selected value came advances to its next data element, and the 9-way comparison process to find the smallest value is repeated, again writing the answer to the output file. This process repeats until all data has been read from all the chunk files.

6)步骤5完成所有数据读取后 - 现在输出文件包含一个完全排序的数据集

6) Once step 5 has finished reading all the data you are done -- your output file now contains a fully sorted data set

使用这种方法,您可以轻松编写自己的通用megasort实用程序,该实用程序采用文件名和maxMemory参数并有效地对文件进行排序。使用临时文件。我敢打赌,你可以找到至少一些实现,但如果没有,你可以按照上面的描述自己滚动。

With this approach you could easily write a generic "megasort" utility of your own that takes a filename and maxMemory parameter and efficiently sorts the file by using temp files. I'd bet you could find at least a few implementations out there for this, but if not you can just roll your own as described above.

这篇关于Java中大型数据集的基于文件的合并排序的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆