并行处理多个文件的bash脚本 [英] bash script for many files in parallel

查看:100
本文介绍了并行处理多个文件的bash脚本的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我已经阅读了有关此主题的类似问题,但是没有一个问题能解决以下问题:

我有一个bash脚本,如下所示:

#!/bin/bash

for filename  in /home/user/Desktop/emak/*.fa; do
    mkdir ${filename%.*}
    cd ${filename%.*}
    mkdir emak
    cd ..
done

此脚本基本上执行以下操作:

  • 遍历目录中的所有文件
  • 使用每个文件的名称创建一个新目录
  • 进入新文件并创建一个名为"emak"的新文件

与创建"emak"文件相比,真正的任务在计算上要花很多钱……

我要遍历大约数千个文件. 由于每个迭代都独立于上一个迭代,因此我想 将其拆分为不同的处理器(我有24个内核),这样我就可以同时处理多个文件.

我阅读了以前的一些有关并行运行的文章(使用:GNU),但是在这种情况下,我看不出明确的方法来应用它.

谢谢

解决方案

使用GNU Parallel这样的事情,您可以创建和导出一个名为doit的bash函数:

#!/bin/bash

doit() {
    dir=${1%.*}
    mkdir "$dir"
    cd "$dir"
    mkdir emak
}
export -f doit
parallel doit ::: /home/user/Desktop/emak/*.fa

如果您的计算昂贵" 部分花费的时间更长,或者特别是可变,您将真正看到这种方法的好处.如果需要花费最多10秒的时间并且是可变的,那么GNU Parallel将在N个并行进程中最短的一个完成后立即提交下一个任务,而不是等待所有N任务完成之后再开始下一批N任务. /p>

作为粗略的基准,这需要58秒:

#!/bin/bash

doit() {
   echo $1
   # Sleep up to 10 seconds
   sleep $((RANDOM*11/32768))
}
export -f doit
parallel -j 10 doit ::: {0..99}

这是直接可比的,耗时87秒:

#!/bin/bash
N=10
for i in {0..99}; do
    echo $i
    sleep $((RANDOM*11/32768)) &
    (( ++count % N == 0)) && wait
done

I have read similar questions about this topic but none of them help me with the following problem:

I have a bash script that looks like this:

#!/bin/bash

for filename  in /home/user/Desktop/emak/*.fa; do
    mkdir ${filename%.*}
    cd ${filename%.*}
    mkdir emak
    cd ..
done

This script basically does the following:

  • Iterate through all files in a directory
  • Create a new directory with the name of each file
  • Go inside the new file and create a new file called "emak"

The real task does something much computational expensive than create the "emak" file...

I have about thousands of files to iterate through. As each iteration is independent from the previous one, I will like to split it in different processors ( I have 24 cores) so I can do multiples files at the same time.

I read some previous post about running in parallel (using: GNU) but I do not see a clear way to apply it in this case.

thanks

解决方案

Something like this with GNU Parallel, whereby you create and export a bash function called doit:

#!/bin/bash

doit() {
    dir=${1%.*}
    mkdir "$dir"
    cd "$dir"
    mkdir emak
}
export -f doit
parallel doit ::: /home/user/Desktop/emak/*.fa

You will really see the benefit of this approach if the time taken by your "computationally expensive" part is longer, or especially variable. If it takes, say up to 10 seconds and is variable, GNU Parallel will submit the next job as soon as the shortest of the N parallel processes completes, rather than waiting for all N to complete before starting the next batch of N jobs.

As a crude benchmark, this takes 58 seconds:

#!/bin/bash

doit() {
   echo $1
   # Sleep up to 10 seconds
   sleep $((RANDOM*11/32768))
}
export -f doit
parallel -j 10 doit ::: {0..99}

and this is directly comparable and takes 87 seconds:

#!/bin/bash
N=10
for i in {0..99}; do
    echo $i
    sleep $((RANDOM*11/32768)) &
    (( ++count % N == 0)) && wait
done

这篇关于并行处理多个文件的bash脚本的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆