排序大文本数据 [英] sorting large text data

查看:78
本文介绍了排序大文本数据的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个大文件(制表符分隔值的1亿行-大小约为1.5GB).基于一个字段对它进行排序的最快已知方法是什么?

I have a large file (100 million lines of tab separated values - about 1.5GB in size). What is the fastest known way to sort this based on one of the fields?

我尝试过蜂巢.我想看看是否可以使用python更快地完成.

I have tried hive. I would like to see if this can be done faster using python.

推荐答案

您是否考虑过使用* nix

Have you considered using the *nix sort program? in raw terms, it'll probably be faster than most Python scripts.

使用-t $'\t'指定它是制表符分隔的,使用-k n指定字段,其中n是字段编号,如果要将结果输出到新文件,则使用-o outputfile. 示例:

Use -t $'\t' to specify that it's tab-separated, -k n to specify the field, where n is the field number, and -o outputfile if you want to output the result to a new file. Example:

sort -t $'\t' -k 4 -o sorted.txt input.txt

将在其第4个字段上对input.txt进行排序,并将结果输出到sorted.txt

Will sort input.txt on its 4th field, and output the result to sorted.txt

这篇关于排序大文本数据的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆