Ubuntu终端-使用gnu parallel读取文件夹中所有文件中的行 [英] Ubuntu terminal - using gnu parallel to read lines in all files in folder
问题描述
我正在尝试计算Ubuntu下一个非常大的文件夹中所有文件的行数.
文件是.gz文件,我使用
zcat * | wc -l
计算所有文件中的所有行,这很慢!
我想为此任务使用多核计算,并发现解决方案
如果您有150,000个文件,您可能会遇到参数列表太长" 的问题.您可以避免这种情况:
find . -name \*gz -maxdepth 1 -print0 | parallel -0 ...
如果要在行数旁边添加名称,则必须自己echo
,因为您的wc
进程将仅从其stdin
中读取,并且不知道文件名:
find ... | parallel -0 'echo {} $(zcat {} | wc -l)'
接下来,我们要提高效率,这将取决于您的磁盘的能力.也许先按parallel -j2
然后按parallel -j4
尝试,然后查看在您的系统上有效的方法.
如Ole在注释中有帮助地指出的那样,您可以避免使用 GNU Parallel 的--tag
选项标记输出行,而不必输出其行被计数的文件的名称,因此效率更高:
find ... | parallel -0 --tag 'zcat {} | wc -l'
I am Trying to count the lines in all the files in a very large folder under Ubuntu.
The files are .gz files and I use
zcat * | wc -l
to count all the lines in all the files, and it's slow!
I want to use multi core computing for this task and found this about Gnu parallel,
I tried to use this bash command:
parallel zcat * | parallel --pipe wc -l
and the cores are not all working I found that the job starting might cause major overhead and tried using batching with
parallel -X zcat * | parallel --pipe -X wc -l
without improvenemt,
how can I use all the cores to count the lines in all the files in a folder given they are all .gz files and need to be decompresses before counting the rows (don't need to keep them uncompressed after)
Thanks!
If you have 150,000 files, you will likely get problems with "argument list too long". You can avoid that like this:
find . -name \*gz -maxdepth 1 -print0 | parallel -0 ...
If you want the name beside the line count, you will have to echo
it yourself, since your wc
process will only be reading from its stdin
and won't know the filename:
find ... | parallel -0 'echo {} $(zcat {} | wc -l)'
Next, we come to efficiency and it will depend on what your disks are capable of. Maybe try with parallel -j2
then parallel -j4
and see what works on your system.
As Ole helpfully points out in the comments, you can avoid having to output the name of the file whose lines are being counted by using GNU Parallel's --tag
option to tag output line, so this is even more efficient:
find ... | parallel -0 --tag 'zcat {} | wc -l'
这篇关于Ubuntu终端-使用gnu parallel读取文件夹中所有文件中的行的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!