排序大文本文件(文本文件中有500万行) [英] sorting a large text file(5 million lines in a text file )

查看:187
本文介绍了排序大文本文件(文本文件中有500万行)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想用几百万种算法对文本文件进行排序。(快速排序)



我的编程语言是C#。



其结构txt文件如下:



I want to sort a text file with a few million of these algorithms to use.(quick sort)

My programming language is C#.

Its structure txt file is as follows:

for instance          desired Result
------------          ---------------      
  723,80                 1,4   
  14,50                  1,5 
  723,2                  10,8
  1,5                   14,50 
  10,8                  723,2 
  1,4                   723,80    







同时,记忆对我来说非常重要。



这个算法是否适合这项工作?



如果合适,请说明这个算法。举个例子



谢谢




At the same time, and memory is very important to me.

This algorithm is suitable for the job?

If appropriate, please give an explanation of this algorithm. And give an example

Thank you

推荐答案

如果记忆是关键因素,那么看看< a href =http://en.wikipedia.org/wiki/External_sorting>外部排序算法 [ ^ ] - 但请注意 - 它们会比内存密集型慢。
If memeory is the critical factor, then look at External sorting algorithms[^] - but be aware - they will be slower than memory intensive ones.


嗨朋友,



你可以用某种方式进行某种合并排序,你要这样做:



1.创建两个缓冲区变量:大小取决于你。



2.创建输入文件的文件句柄。您可能需要在磁盘上执行一些繁重的临时数据文件生成。因此,它在系统的RAM(或物理内存)上实际上是轻量级的。



3.在缓冲区1中加载第一个块并创建一个新线程并对此块进行排序缓冲区1,同时,在下一个缓冲区中加载下一个块,当线程完成它的动作时,它会将这些数据存储在外部临时文件中(您可以对文件进行编号和命名。



4.然后使用缓冲区2对此块执行相同的处理,同时在缓冲区1中加载下一个块。这样就创建了一个双缓冲系统。



5.现在,一旦所有的中断和排序单个部分完成,开始做下一个过程,这称为合并部分,你必须从你创建的临时文件开始读取,求助它们如果需要并将它们合并到一个新文件中。这样,最终,整个文件将被重新排序和合并。



6.现在,你需要的最后一件事要做的是,覆盖以前的文件这个新文件,通过清理所有临时文件等来做一些垃圾收集。



7.最后,关闭文件句柄,终止线程并给用户一个排序成功完成的消息。



8.目前,我可以说这是迄今为止我能看到的唯一合乎逻辑的方法,不过我还要考虑更多,如果可能,更好的东西。



希望这有帮助。虽然您还应该查看OriginalGriff在Solution 1中发布的链接。也许该链接包含一些更好的方法来执行此操作。



有问候

Tushar Srivastava
Hi Friend,

Can you do some kind of merge sort with some way where, you are going to do this :

1. Create two buffer variables : size depends on you.

2. Create a file handle to the input file. You may need to do some heavy temporary data file generation on the disk. So, it's actually lightweight on RAM (or physical memory) of the system.

3. Load first block in buffer one and create a new thread and sort this chunk in buffer one, meanwhile, load next chunk in next buffer and when the thread completes it's action, it will store this data on an external temporary file (you can number and name the files uniquely.

4. Then use the buffer two to do the same process on this chunk also and while doing so, load next chunk in buffer one. This way you have created a double buffering system.

5. Now, once all the break and sort individual part is done, start to do next process, this is called merging part and you have to start reading in from temporary files you created, resort them if needed and merge them in a new file. This way, ultimately, the complete file is sorted and merged again.

6. Now, the only last thing you need to do is, overwrite the previous file by this new file, do some garbage collection by cleaning up all the temporary files etc.

7. Finally, close the file handle, kill the thread and give the user a message that sorting is completed successfully.

8. At the moment, I can say that this is the only logical method I can see so far, though I have to think more to come up with, if possible, something better.

Hope that this helped. Though you should also have a look at the link posted by OriginalGriff in Solution 1. Maybe that link contain some better way to do this.

With Regards
Tushar Srivastava


500万个字符串,每个字符串长度少于10个字符(来自您的样本)对主内存(50 MB)而言并不是什么大不了的事。



我建议您将整个文件加载到一个大缓冲区中,作为连续的空终止字符串,以及每个字符串开头的索引数组(这将占用4个额外的字节)每行)。您将使用quicksort,对索引数组进行排序(而不是自己移动字符串)。然后输出已排序的文件。



您的文件似乎包含数值,每行1个。如果始终如此,则可以考虑将值加载到浮点数组(单精度或双精度,即每行4或8个字节),并对数组进行就地排序。输入和输出值时,请注意格式转换。
5 million strings, each less than 10 characters long (from your sample) is not such a big deal for main memory (50 MB).

I suggest that you load the whole file in a large buffer, as contiguous null terminated strings, together with an array of indexes to the start of every string (this will consume 4 additional bytes per row). You will use quicksort, sorting the array of indexes (and not moving the string themselves). Then output the sorted file.

It seems that your file contains numerical values, 1 per row. If this is always the case, you can consider loading the values to an array of floating-points (single or double precision, i.e. just 4 or 8 bytes per row), and sort the array in-place. Take care of the format conversions when inputting and outputting the values.


这篇关于排序大文本文件(文本文件中有500万行)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆