GNU Parallel作为具有命名管道的作业队列 [英] GNU Parallel as job queue with named pipes

查看:95
本文介绍了GNU Parallel作为具有命名管道的作业队列的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我遵循了示例代码来创建如下所示的gnu并行作业队列

I followed the sample code to create a gnu parallel job queue as below

// create a job queue file
touch jobqueue

//start the job queue
tail -f jobqueue | parallel -u php worker.php 

// in another shell, add the data 
while read LINE; do echo $LINE >> jobqueue; done < input_data_file.txt 

这种方法确实可以工作,并且可以将工作作为简单的工作队列来处理.但是有两个问题

This approach does work and handles the job as a simple job queue. But there are two problems

1-从输入文件读取数据,然后将其写入作业队列(另一个文件)很慢,因为它涉及磁盘I/O.

1- reading data from input file and then writing it to the jobqueue (another file) is slow as it involves disk I/O.

2-如果由于某种原因我的作业在中间中断,并且重新启动并行处理,它将重新运行作业队列文件中的所有作业

2- if for some reason my job aborts in the middle, and I restart the parallel processing, it will re-run all the jobs in the jobqueue file

我可以在worker.php中添加一个脚本,以在完成工作后从作业队列中删除该行,但是我觉得有更好的方法.

I can add a script in worker.php to actually remove the line from jobqueue when the job is done, but I feel there is a better way to this.

是否有可能代替使用

tail -f jobqueue

我可以使用命名管道作为并行输入,而我当前的设置仍然可以作为简单队列工作?

I can use a named pipe as input to parallel and my current setup can still work as a simple queue?

我想这样一来,我将不必从管道中删除已完成的行,因为在读取时会自动将其删除?

I guess that way I won't have to remove the lines from pipe which are done as that will be automatically removed on read?

P.S.我知道并且我使用过RabbitMQ,ZeroMQ(我喜欢它),nng,nanomsg,甚至php pcntl_fork以及pthreads.因此,这不是什么要进行并行处理的问题.用gnu并行创建工作队列更成问题.

P.S. I know and I have used RabbitMQ, ZeroMQ (and I love it), nng, nanomsg, and even php pcntl_fork as well as pthreads. So it is not a question of what is there for parallel processing. It is more of a question to create a working queue with gnu parallel.

推荐答案

while read LINE; do echo $LINE >> jobqueue; done < input_data_file.txt 

这可以更快地完成:

cat >> jobqueue < input_data_file.txt 

fifo可能会起作用,但会阻塞.这意味着您不能在队列中放很多东西-这样做会破坏队列的目的.

While a fifo may work, it will block. That means you cannot put a lot in the queue - which sort of defeats the purpose of a queue.

如果磁盘I/O成为读取实际作业的问题,我会感到惊讶:GNU Parallel每秒可以运行100-1000个作业.作业最多可以为128 KB,因此最多您的磁盘必须提供128 MB/s的速度.如果您不是每秒运行100个作业,那么队列的磁盘I/O将永远不会成为问题.

I am surprised if disk I/O is an issue for reading the actual jobs: GNU Parallel can run 100-1000 jobs per second. Jobs can at most be 128 KB, so at the very most your disk has to deliver 128 MB/s. If you are not running 100 jobs per second, then disk I/O of the queue will never be an issue.

如果重新启动,可以使用--resume --joblog mylog跳过已经运行的作业:

You can use --resume --joblog mylog to skip jobs already run if you restart:

# Initialize queue
true >jobqueue
# (Re)start running the queue 
tail -n+0 -f jobqueue | parallel --resume --joblog mylog

这篇关于GNU Parallel作为具有命名管道的作业队列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆