为什么使用管道进行排序(Linux命令)很慢? [英] Why using pipe for sort (linux command) is slow?

查看:314
本文介绍了为什么使用管道进行排序(Linux命令)很慢?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个〜8GB的大文本文件,我需要进行一些简单的过滤,然后对所有行进行排序.我在28核计算机上,具有SSD和128GB RAM.我尝试过

I have a large text file of ~8GB which I need to do some simple filtering and then sort all the rows. I am on a 28-core machine with SSD and 128GB RAM. I have tried

方法1

awk '...' myBigFile | sort --parallel = 56 > myBigFile.sorted

方法2

awk '...' myBigFile > myBigFile.tmp
sort --parallel 56 myBigFile.tmp > myBigFile.sorted

令人惊讶地,方法1花费了11.5分钟,而方法2仅花费了(0.75 + 1 <2)分钟.为什么通过管道传输时排序如此缓慢?它不是平行的吗?

Surprisingly, method1 takes 11.5 min while method2 only takes (0.75 + 1 < 2) min. Why is sorting so slow when piped? Is it not paralleled?

编辑

awkmyBigFile并不重要,只需使用seq 1 10000000 | sort --parallel 56(由于@Sergei Kurenkov),该实验就可以重复进行,并且我还发现在我的机器上使用非管道版本可以将速度提高六倍

awk and myBigFile is not important, this experiment is repeatable by simply using seq 1 10000000 | sort --parallel 56 (thanks to @Sergei Kurenkov), and I also observed a six-fold speed improvement using un-piped version on my machine.

推荐答案

从管道读取时,sort假定文件很小,并且对于小文件,并行性没有帮助.为了使sort利用并行性,您需要使用-S告诉它分配一个大的主内存缓冲区.在这种情况下,数据文件约为8GB,因此可以使用-S8G.但是,至少在具有128GB主内存的系统上,方法2可能仍会更快.

When reading from a pipe, sort assumes that the file is small, and for small files parallelism isn't helpful. To get sort to utilize parallelism you need to tell it to allocate a large main memory buffer using -S. In this case the data file is about 8GB, so you can use -S8G. However, at least on your system with 128GB of main memory, method 2 may still be faster.

这是因为方法2中的sort可以从文件的大小中得知它很大,并且可以在文件中进行搜索(对于管道来说,两者都不可行).此外,由于与这些文件大小相比,您的内存很大,因此myBigFile.tmp的数据无需在awk退出之前写入磁盘,并且sort将能够从缓存而不是磁盘读取文件.因此,方法1和方法2(在像您这样的具有大量内存的机器上)之间的原理区别在于,方法2中的sort知道文件很大,并且可以轻松地划分工作(可能使用seek,但是我没有) t着眼于实现),而在方法1中,sort必须发现数据巨大,并且由于无法找到管道,因此不能在读取输入时使用任何并行性.

This is because sort in method 2 can know from the size of the file that it is huge, and it can seek in the file (neither of which is possible for a pipe). Further, since you have so much memory compared to these file sizes, the data for myBigFile.tmp need not be written to disc before awk exits, and sort will be able to read the file from cache rather than disc. So the principle difference between method 1 and method 2 (on a machine like yours with lots of memory) is that sort in method 2 knows the file is huge and can easily divide up the work (possibly using seek, but I haven't looked at the implementation), whereas in method 1 sort has to discover the data is huge, and it can not use any parallelism in reading the input since it can't seek the pipe.

这篇关于为什么使用管道进行排序(Linux命令)很慢?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆