GNU Parallel作为具有命名管道的作业队列 [英] GNU Parallel as job queue with named pipes
问题描述
我遵循了示例代码来创建如下所示的gnu并行作业队列
I followed the sample code to create a gnu parallel job queue as below
// create a job queue file
touch jobqueue
//start the job queue
tail -f jobqueue | parallel -u php worker.php
// in another shell, add the data
while read LINE; do echo $LINE >> jobqueue; done < input_data_file.txt
这种方法确实可以工作,并且可以将工作作为简单的工作队列来处理.但是有两个问题
This approach does work and handles the job as a simple job queue. But there are two problems
1-从输入文件读取数据,然后将其写入作业队列(另一个文件)很慢,因为它涉及磁盘I/O.
1- reading data from input file and then writing it to the jobqueue (another file) is slow as it involves disk I/O.
2-如果由于某种原因我的作业在中间中断,并且重新启动并行处理,它将重新运行作业队列文件中的所有作业
2- if for some reason my job aborts in the middle, and I restart the parallel processing, it will re-run all the jobs in the jobqueue file
我可以在worker.php中添加一个脚本,以在完成工作后从作业队列中删除该行,但是我觉得有更好的方法.
I can add a script in worker.php to actually remove the line from jobqueue when the job is done, but I feel there is a better way to this.
是否有可能代替使用
tail -f jobqueue
我可以使用命名管道作为并行输入,而我当前的设置仍然可以作为简单队列工作?
I can use a named pipe as input to parallel and my current setup can still work as a simple queue?
我想这样一来,我将不必从管道中删除已完成的行,因为在读取时会自动将其删除?
I guess that way I won't have to remove the lines from pipe which are done as that will be automatically removed on read?
P.S.我知道并且我使用过RabbitMQ,ZeroMQ(我喜欢它),nng,nanomsg,甚至php pcntl_fork以及pthreads.因此,这不是什么要进行并行处理的问题.用gnu并行创建工作队列更成问题.
P.S. I know and I have used RabbitMQ, ZeroMQ (and I love it), nng, nanomsg, and even php pcntl_fork as well as pthreads. So it is not a question of what is there for parallel processing. It is more of a question to create a working queue with gnu parallel.
推荐答案
while read LINE; do echo $LINE >> jobqueue; done < input_data_file.txt
这可以更快地完成:
cat >> jobqueue < input_data_file.txt
fifo可能会起作用,但会阻塞.这意味着您不能在队列中放很多东西-这样做会破坏队列的目的.
While a fifo may work, it will block. That means you cannot put a lot in the queue - which sort of defeats the purpose of a queue.
如果磁盘I/O成为读取实际作业的问题,我会感到惊讶:GNU Parallel每秒可以运行100-1000个作业.作业最多可以为128 KB,因此最多您的磁盘必须提供128 MB/s的速度.如果您不是每秒运行100个作业,那么队列的磁盘I/O将永远不会成为问题.
I am surprised if disk I/O is an issue for reading the actual jobs: GNU Parallel can run 100-1000 jobs per second. Jobs can at most be 128 KB, so at the very most your disk has to deliver 128 MB/s. If you are not running 100 jobs per second, then disk I/O of the queue will never be an issue.
如果重新启动,可以使用--resume --joblog mylog
跳过已经运行的作业:
You can use --resume --joblog mylog
to skip jobs already run if you restart:
# Initialize queue
true >jobqueue
# (Re)start running the queue
tail -n+0 -f jobqueue | parallel --resume --joblog mylog
这篇关于GNU Parallel作为具有命名管道的作业队列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!