在bash中使用多个核心 [英] Use more than one core in bash
问题描述
我有一个Linux工具(极大地简化了),它切割了illumnaSeq文件中指定的序列.我有32个文件要研磨.一个文件将在5个小时内处理完毕.我在centos上有一台服务器,它有128个内核.
I have a linux tool that (greatly simplifying) cuts me the sequences specified in illumnaSeq file. I have 32 files to grind. One file is processed in about 5 hours. I have a server on the centos, it has 128 cores.
我找到了一些解决方案,但是每个解决方案的工作方式都只使用一个内核.最后一个似乎触发了32个小节,但是它仍然会用一个核心对整个过程加压.
I've found a few solutions, but each one works in a way that only uses one core. The last one seems to fire 32 nohups, but it'll still pressurize the whole thing with one core.
我的问题是,没有人知道如何利用服务器的潜力吗?因为基本上每个文件都可以独立处理,所以它们之间没有关系.
My question is, does anyone have any idea how to use the server's potential? Because basically every file can be processed independently, there are no relations between them.
这是脚本的当前版本,我不知道为什么它只使用一个内核.我在这里的建议的帮助下编写了它,并在Internet上找到了
This is the current version of the script and I don't know why it only uses one core. I wrote it with the help of advice here on stack and found on the Internet:
#!/bin/bash
FILES=/home/daw/raw/*
count=0
for f in $FILES
to
base=${f##*/}
echo "process $f file..."
nohup /home/daw/scythe/scythe -a /home/daw/scythe/illumina_adapters.fa -o "OUT$base" $f &
(( count ++ ))
if (( count = 31 )); then
wait
count=0
fi
done
我正在解释:FILES是原始文件夹中文件的列表.
I'm explaining: FILES is a list of files from the raw folder.
执行nohup的核心"行:第一个路径是工具的路径,-a路径是带有要剪切的模式的文件的路径,out开头与已处理的文件名+ OUT相同.最后一个参数是要处理的输入文件.
The "core" line to execute nohup: the first path is the path to the tool, -a path is the path to the file with paternas to cut, out saves the same file name as the processed + OUT at the beginning. The last parameter is the input file to be processed.
以下自述工具: https://github.com/vsbuffalo/scythe
有人知道你怎么处理吗?
Does anybody know how you can handle it?
P.S.我还尝试过在计数前移动nohup,但是它仍然使用一个内核.我对服务器没有限制.
P.S. I also tried move nohup before count, but it's still use one core. I have no limitation on server.
推荐答案
恕我直言,最可能的解决方案是 GNU Parallel ,因此您最多可以并行运行64个作业,如下所示:
IMHO, the most likely solution is GNU Parallel, so you can run up to say, 64 jobs in parallel something like this:
parallel -j 64 /home/daw/scythe/scythe -a /home/daw/scythe/illumina_adapters.fa -o OUT{.} {} ::: /home/daw/raw/*
这样做的好处是,作业不会被批处理,它可以始终保持64个运行,并在每个作业完成时开始一个新作业,这比在启动最后一个作业之前可能要等待4.9个小时来完成所有32个作业要好之后又需要5个小时.请注意,我在这里随意选择了64个作业,如果没有另外指定, GNU Parallel 将在您拥有的每个CPU内核中运行1个作业.
This has the benefit that jobs are not batched, it keeps 64 running at all times, starting a new one as each job finishes, which is better than waiting potentially 4.9 hours for all 32 of your jobs to finish before starting the last one which takes a further 5 hours after that. Note that I arbitrarily chose 64 jobs here, if you don't specify otherwise, GNU Parallel will run 1 job per CPU core you have.
有用的其他参数是:
-
parallel --bar ...
提供进度条 -
parallel --dry-run ...
进行了空运行,因此您可以看到它实际上将不执行任何操作
parallel --bar ...
gives a progress barparallel --dry-run ...
does a dry run so you can see what it would do without actually doing anything
如果有多个服务器,则可以将它们添加到列表中,并且 GNU Parallel 也会在其中分配作业:
If you have multiple servers available, you can add them in a list and GNU Parallel will distribute the jobs amongst them too:
parallel -S server1,server2,server3 ...
这篇关于在bash中使用多个核心的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!