排序大文本数据 [英] sorting large text data
问题描述
我有一个大文件(制表符分隔值的1亿行-大小约为1.5GB).基于一个字段对它进行排序的最快已知方法是什么?
I have a large file (100 million lines of tab separated values - about 1.5GB in size). What is the fastest known way to sort this based on one of the fields?
我尝试过蜂巢.我想看看是否可以使用python更快地完成.
I have tried hive. I would like to see if this can be done faster using python.
推荐答案
Have you considered using the *nix sort
program? in raw terms, it'll probably be faster than most Python scripts.
使用-t $'\t'
指定它是制表符分隔的,使用-k n
指定字段,其中n
是字段编号,如果要将结果输出到新文件,则使用-o outputfile
.
示例:
Use -t $'\t'
to specify that it's tab-separated, -k n
to specify the field, where n
is the field number, and -o outputfile
if you want to output the result to a new file.
Example:
sort -t $'\t' -k 4 -o sorted.txt input.txt
将在其第4个字段上对input.txt
进行排序,并将结果输出到sorted.txt
Will sort input.txt
on its 4th field, and output the result to sorted.txt
这篇关于排序大文本数据的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!