GNU并行:从起始目录为整个树的每个节点(目录和子目录)分配一个线程 [英] GNU parallel: assign one thread for each node (directories and sub* directories) of an entire tree from a start directory

查看:79
本文介绍了GNU并行:从起始目录为整个树的每个节点(目录和子目录)分配一个线程的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想从macOS上的parallel命令的所有潜力中受益(似乎存在2个版本,GNU和Ole Tange的版本,但我不确定).

使用以下命令:

parallel -j8  find {} ::: *

如果我位于一个包含8个子目录的目录中,则我的性能将会很高. 但是,如果所有这些子目录仅具有一个小的内容,那么我将只有一个线程可以在唯一的大"线程上工作.目录.

  1. 是否有一种方法可以对此大目录"进行并行化?我的意思是,剩下的唯一线程可以由其他线程(以前在小型子目录上工作的线程)帮助吗?

    理想的情况是并行命令自动切换".当在上面的命令行中通过find命令找到了所有小子时.也许我问得太多了?

  2. 如果存在另一种潜在的优化方法:考虑一个通用的树目录结构:是否有一种类似于命令make -j8的方法将每个当前线程分配给一个sub-(sub-(sub- ....))))目录中,一旦浏览了当前目录(别忘了,我主要希望通过find Linux命令使用此优化),另一个线程将浏览另一个目录sub-(sub-( sub- ....)))))目录?

    当然,正在运行的线程总数不大于使用parallel命令指定的数量(在上面的示例中为parallel -j8):我们可以说,如果有许多树元素(1个节点= 1个目录) )大于线程数,我们不能超过该线程数.

    我知道在递归上下文中进行并行化很棘手,但是当我想在一个大树形结构中查找文件时,也许我可以获得一个重要的因素?

    这就是为什么我以命令make -j8为例:我不知道它是如何编码的,但是这让我认为我们可以在帖子开头用几个parallel/find命令行做同样的事情.

最后,我想获得您对这两个问题的建议,更笼统地说,这些优化建议可能是什么,目前尚不可能,以便通过经典的find命令更快地找到文件.

更新1:正如@OleTange所说,我不知道要gupdatedb编制索引的目录结构的先验.因此,很难事先知道maxdepth.您的解决方案很有趣,但是find的首次执行不是多线程的,因此您无需使用parallel命令.令我感到惊讶的是,不存在gupdatedb的多线程版本:在纸上,它是可行的,但是一旦我们要在MacOS 10.15的脚本GNU gupdatedb中对其进行编码,就更加困难.

如果有人还有其他建议,我会接受!

解决方案

如果要并行化find,则需要确保磁盘可以传递数据.

对于磁性驱动器,您几乎看不到加速.对于RAID,网络驱动器和SSD有时,对于NVMe经常.

并行化find的最简单方法是使用*/*:

parallel find ::: */*

*/*/*:

parallel find ::: */*/*

这将在子子目录和子子目录中搜索.

他们不会搜索顶部目录,但是可以通过使用适当的-maxdepth运行单个附加的find来完成.

以上解决方案假定您对目录结构有所了解,因此它不是一般的解决方案.

我从未听说过一般的解决方案.这将涉及广度优先搜索,这将同时启动一些工作人员.我可以看到如何编程,但从未见过.

如果我要实现它,它将是这样的(经过严格测试):

#!/bin/bash

tmp=$(tempfile)
myfind() {
  find "$1" -mindepth 1 -maxdepth 1
}
export -f myfind
myfind . | tee $tmp
while [ -s $tmp ] ; do
    tmp2=$(tempfile)
    cat $tmp | parallel --lb myfind | tee $tmp2
    mv $tmp2 $tmp
done
rm $tmp

(PS:我有理由相信Ole Tange和GNU Parallel编写的并行是相同的.)

I would like to benefit from all the potential of parallel command on macOS (it seems there exists 2 versions, GNU and Ole Tange's version but I am not sure).

With the following command:

parallel -j8  find {} ::: *

I will have a big performance if I am located in a directory containing 8 subdirectories. But if all these subdirectories have a small content except for only one, I will have only one thread which will work on the unique "big" directory.

  1. Is there a way to follow the parallelization for this "big directory"? I mean, can the unique thread remaining be helped by other threads (the previous that worked on small subdirectories)?

    The ideal case would be that parallel command "switch automatically" when all small sub has been found by find command in the command line above. Maybe I ask too much?

  2. Another potential optimization if it exists: considering a common tree directory structure: Is there a way, similar to for example the command make -j8, to assign each current thread to a sub-(sub-(sub- ....)))) directory and once the current directory has been explored (don't forget, I would like mostly to use this optimization with find Linux command), another thread explore another directory sub-(sub-(sub- ....)))) directory?

    Of course, the number of total threads running is not greater than the number specified with parallel command (parallel -j8 in my example above): we can say that if a number of tree elements (1 node=1 directory) are greater than a number of threads, we cannot be over this number.

    I know that parallelize in a recursive context is tricky but maybe I can gain a significant factor when I want to find a file into a big tree structure?

    That's why I take the example of command make -j8: I don't know how it is coded but that makes me think that we could do the same with the couple parallel/find command line at the beginning of my post.

Finally, I would like to get your advice about these 2 questions and more generally what is possible and what is not possible currently for these suggestions of optimization in order to find more quickly a file with classical find command.

UPDATE 1: As @OleTange said, I don't know the directory structure a priori of what I want gupdatedb to index. So, it is difficult to know the maxdepth in advance. Your solution is interesting but the first execution of find is not multithreaded, you don't use parallel command. I am a little surprised that a multithread version of gupdatedb does not exist : on paper, it is faisible but once we want to code it in the script GNU gupdatedb of MacOS 10.15, it is more difficult.

If someone could have other suggestions, I would take them !

解决方案

If you are going to parallelize find you need to be sure that your disk can deliver data.

For magnetic drives you will rarely see a speedup. For RAID, network drives and SSD sometimes, and for NVMe often.

The simplest way to parallelize find is to use */*:

parallel find ::: */*

Or */*/*:

parallel find ::: */*/*

This will search in sub-sub dirs and in sub-sub-sub dirs.

They will not search the top dirs, but that can be done by running a single additional find with the appropriate -maxdepth.

The above solution assumes you know something about the directory structure, so it is not a general solution.

I have never heard of a general solution. It would involve a breadth first search that would start some workers in parallel. I can see how it could be programmed, but I have never seen it.

If I were to implement it, it would be something like this (lightly tested):

#!/bin/bash

tmp=$(tempfile)
myfind() {
  find "$1" -mindepth 1 -maxdepth 1
}
export -f myfind
myfind . | tee $tmp
while [ -s $tmp ] ; do
    tmp2=$(tempfile)
    cat $tmp | parallel --lb myfind | tee $tmp2
    mv $tmp2 $tmp
done
rm $tmp

(PS: I have reason to believe the parallel written by Ole Tange and GNU Parallel are one and the same).

这篇关于GNU并行:从起始目录为整个树的每个节点(目录和子目录)分配一个线程的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆