排序大文件(代码/性能) [英] Sorting Large File (Code/Performance)

查看:75
本文介绍了排序大文件(代码/性能)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

大家好,


我有一个带有1.6亿行(~2GB)的Unicode文本文件我希望

来排序基于前两个字符。


如果有人可以发布可以提供帮助的示例代码,我将不胜感激。

我这样做。

此外,任何关于大约需要多长时间的想法才能获得(b,双核2.0GHz w / 2GB RAM)的分类过程。


干杯,

Ira

Hello all,

I have an Unicode text file with 1.6 billon lines (~2GB) that I''d like
to sort based on first two characters.

I''d greatly appreciate if someone can post sample code that can help
me do this.

Also, any ideas on approximately how long is the sort process going to
take (XP, Dual Core 2.0GHz w/2GB RAM).

Cheers,

Ira

推荐答案

Ir ******* @ gmail.com 写道:

我有一个带有1.6亿行(~2GB)的Unicode文本文件我希望

根据前两个字符排序。


我会非常感谢,如果有人可以发布可以提供帮助的示例代码,请执行此操作。
I have an Unicode text file with 1.6 billon lines (~2GB) that I''d like
to sort based on first two characters.

I''d greatly appreciate if someone can post sample code that can help
me do this.



使用unix sort命令:

sort inputfile -o outputfile


我认为有一个cygwin端口。

Use the unix sort command:

sort inputfile -o outputfile

I think there is a cygwin port.


此外,任何关于大约多长时间的想法是排序过程

take(XP ,双核2.0GHz w / 2GB RAM)。
Also, any ideas on approximately how long is the sort process going to
take (XP, Dual Core 2.0GHz w/2GB RAM).



Eh,unix排序可能需要一段时间,介于15

分钟和1小时之间。如果你只需要这样做就不值得

编写特殊用途代码。如果你必须做很多事情,那就给那个盒子拿一些

ram,将文件吸入内存并进行基数排序。

Eh, unix sort would probably take a while, somewhere between 15
minutes and an hour. If you only have to do it once it''s not worth
writing special purpose code. If you have to do it a lot, get some
more ram for that box, suck the file into memory and do a radix sort.


Ir*******@gmail.com 写道:

大家好,


我有一个带有1.6亿行(~2GB)的Unicode文本文件,我想要

到基于前两个字符排序。
Hello all,

I have an Unicode text file with 1.6 billon lines (~2GB) that I''d like
to sort based on first two characters.



鉴于这些数字,每行的平均字符数是

小于2.请检查。

John Nagle

Given those numbers, the average number of characters per line is
less than 2. Please check.

John Nagle


感谢所有回复的人。非常感谢。


是的,我必须重新检查行数并且行数是~16

百万(不包括1.6B) )。


另外:
Thanks to all who replied. It''s very appreciated.

Yes, I had to doublecheck line counts and the number of lines is ~16
million (insetead of stated 1.6B).

Also:

>什么是Unicode文本文件?怎么编码:utf8,utf16,utf16le,utf16be,???如果您不知道,请执行以下操作:
>What is a "Unicode text file"? How is it encoded: utf8, utf16, utf16le, utf16be, ??? If you don''t know, do this:



文件为UTF-8

The file is UTF-8


Do前两个字符总是属于ASCII子集?
Do the first two characters always belong to the ASCII subset?



是的,前两个总是属于ASCII子集

Yes, first two always belong to ASCII subset


你打算用它做什么?它有分类吗?
What are you going to do with it after it''s sorted?



我需要隔离所有以两个字符开头的行(zz为

特别)

I need to isolate all lines that start with two characters (zz to be
particular)


这是一个开始: http:/ /docs.python.org/lib/typesseq-mutable.html

Google" GnuWin32"并看看他们的排序是否符合您的要求。
Here''s a start: http://docs.python.org/lib/typesseq-mutable.html
Google "GnuWin32" and see if their sort does what you want.



这样做,谢谢你的提示。

Will do, thanks for the tip.


如果你真的有2GB文件,只有2GB RAM,我建议你不要屏住呼吸。
If you really have a 2GB file and only 2GB of RAM, I suggest that you don''t hold your breath.



我资源有限。不幸的是。


干杯,


Ira

I am limited with resources. Unfortunately.

Cheers,

Ira


这篇关于排序大文件(代码/性能)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆