Ubuntu终端-使用gnu parallel读取文件夹中所有文件中的行 [英] Ubuntu terminal - using gnu parallel to read lines in all files in folder

查看:174
本文介绍了Ubuntu终端-使用gnu parallel读取文件夹中所有文件中的行的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试计算Ubuntu下一个非常大的文件夹中所有文件的行数.

文件是.gz文件,我使用

zcat * | wc -l

计算所有文件中的所有行,这很慢!

我想为此任务使用多核计算,并发现解决方案

如果您有150,000个文件,您可能会遇到参数列表太长" 的问题.您可以避免这种情况:

find . -name \*gz -maxdepth 1 -print0 | parallel -0 ...

如果要在行数旁边添加名称,则必须自己echo,因为您的wc进程将仅从其stdin中读取,并且不知道文件名:

find ... | parallel -0 'echo {} $(zcat {} | wc -l)'

接下来,我们要提高效率,这将取决于您的磁盘的能力.也许先按parallel -j2然后按parallel -j4尝试,然后查看在您的系统上有效的方法.


如Ole在注释中有帮助地指出的那样,您可以避免使用 GNU Parallel --tag选项标记输出行,而不必输出其行被计数的文件的名称,因此效率更高:

find ... | parallel -0 --tag 'zcat {} | wc -l'

I am Trying to count the lines in all the files in a very large folder under Ubuntu.

The files are .gz files and I use

zcat * | wc -l

to count all the lines in all the files, and it's slow!

I want to use multi core computing for this task and found this about Gnu parallel,

I tried to use this bash command:

parallel zcat * | parallel --pipe wc -l

and the cores are not all working I found that the job starting might cause major overhead and tried using batching with

parallel -X zcat * | parallel --pipe -X wc -l

without improvenemt,

how can I use all the cores to count the lines in all the files in a folder given they are all .gz files and need to be decompresses before counting the rows (don't need to keep them uncompressed after)

Thanks!

解决方案

If you have 150,000 files, you will likely get problems with "argument list too long". You can avoid that like this:

find . -name \*gz -maxdepth 1 -print0 | parallel -0 ...

If you want the name beside the line count, you will have to echo it yourself, since your wc process will only be reading from its stdin and won't know the filename:

find ... | parallel -0 'echo {} $(zcat {} | wc -l)'

Next, we come to efficiency and it will depend on what your disks are capable of. Maybe try with parallel -j2 then parallel -j4 and see what works on your system.


As Ole helpfully points out in the comments, you can avoid having to output the name of the file whose lines are being counted by using GNU Parallel's --tag option to tag output line, so this is even more efficient:

find ... | parallel -0 --tag 'zcat {} | wc -l'

这篇关于Ubuntu终端-使用gnu parallel读取文件夹中所有文件中的行的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆