排序20GB的数据 [英] Sorting 20GB of data

查看:168
本文介绍了排序20GB的数据的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在过去,我不得不使用大文件,在0.1-3GB的范围内。并不是所有的列都是需要的,所以将其余的数据装入RAM是可以的。
现在我必须使用1-20GB范围内的文件,并且随着时间的推移,它们可能会增长。这是完全不同的,因为您不能再将数据装入RAM。

In the past I had to work with big files, somewhere about in the 0.1-3GB range. Not all the 'columns' were needed so it was ok to fit the remaining data in RAM. Now I have to work with files in 1-20GB range, and they will probably grow as the time will pass. That is totally different because you cannot fit the data in RAM anymore.

我的文件包含数百万个条目(我发现一个具有30密耳条目)。在条目中包含大约10个列:一个字符串(50-1000个unicode字符)和几个数字。我必须通过列对数据进行排序并显示。对于用户来说,只有最高的条目(1-30%)是相关的,其余的是低质量的数据。

My file contains several millions of 'entries' (I have found one with 30 mil entries). On entry consists in about 10 'columns': one string (50-1000 unicode chars) and several numbers. I have to sort the data by 'column' and show it. For the user only the top entries (1-30%) are relevant, the rest is low quality data.

所以,我需要一些关于头朝哪个方向的建议出来我绝对不想将数据放在数据库中,因为它们难以为非计算机精通人员安装和配置。我喜欢提供一个单一的程序。

So, I need some suggestions about in which direction to head out. I definitively don't want to put data in a DB because they are hard to install and configure for non computer savvy persons. I like to deliver a monolithic program.

显示数据并不困难。但是排序...没有在RAM中加载数据,普通PC(2-6GB RAM)...将会杀死一些好时机。

Showing the data is not difficult at all. But sorting... without loading the data in RAM, on regular PCs (2-6GB RAM)... will kill some good hours.

我正在看MML(内存映射文件),但是丹尼·索普(Danny Thorpe)的这篇文章显示它可能不合适: http://dannythorpe.com/2004/03/19/the-hidden-costs-of-memory-mapped-files/

I was looking a bit into MMF (memory mapped files) but this article from Danny Thorpe shows that it may not be suitable: http://dannythorpe.com/2004/03/19/the-hidden-costs-of-memory-mapped-files/

所以,我正在考虑只加载列中需要排序的数据,并将其指向地址(到磁盘文件中) )的入口。我排序'列',然后我使用指针找到与每个列单元格对应的条目,并恢复条目。 恢复将直接写入磁盘,因此不需要额外的RAM。

So, I was thinking about loading only the data from the column that has to be sorted in ram AND a pointer to the address (into the disk file) of the 'entry'. I sort the 'column' then I use the pointer to find the entry corresponding to each column cell and restore the entry. The 'restoration' will be written directly to disk so no additional RAM will be required.

PS:我正在寻找一种在拉撒路和德尔福工作的解决方案,因为Lazarus(实际上是FPC)对Mac有64位支持。 64位意味着更多的RAM可用=更快的排序。

PS: I am looking for a solution that will work both on Lazarus and Delphi because Lazarus (actually FPC) has 64 bit support for Mac. 64 bit means more RAM available = faster sorting.

推荐答案

我认为一种方式是 Mergesort ,这是一个很好的算法,用于排序
大量的有限内存的固定记录。

I think a way to go is Mergesort, it's a great algorithm for sorting a large amount of fixed records with limited memory.

一般想法:


  • 从输入文件中读取N行(允许您将行保留在内存中的值)

  • 将这些行排序并将排序后的行写入文件1

  • 重复以下N行以获取文件2

  • read N lines from the input file (a value that allows you to keep the lines in memory)
  • sort these lines and write the sorted lines to file 1
  • repeat with the next N lines to obtain file 2

...

你到达输入文件的末尾,你现在有M个文件)

you reach the end of the input file and you now have M files (each of which is sorted)

您还可以考虑基于嵌入式数据库的解决方案,例如 Firebird embedded :它适用于Delphi / Windows,您只需在程序文件夹中添加一些DLL(我不确定关于拉撒路/ OSX )。

You could also consider a solution based on an embedded database, e.g. Firebird embedded: it works well with Delphi/Windows and you only have to add some DLL in your program folder (I'm not sure about Lazarus/OSX).

这篇关于排序20GB的数据的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆