同步Unix池化进程有什么简单的机制? [英] What simple mechanism for synchronous Unix pooled processes?

查看:133
本文介绍了同步Unix池化进程有什么简单的机制?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我需要限制并行执行的进程数.例如,我想执行以下psuedo命令行:

export POOL_PARALLELISM=4
for i in `seq 100` ; do
    pool foo -bar &
done

pool foo -bar # would not complete until the first 100 finished.

因此,尽管有101个foo排队等待运行,但在任何给定时间都只有4个正在运行. pool将fork()/exit()并将其余进程排队直到完成.

是否有使用Unix工具执行此操作的简单机制? atbatch并不适用,因为它们通常是在第一时间调用并顺序执行作业.使用队列不一定是最好的,因为我希望它们同步.

在我编写一个使用信号量和共享内存的C包装程序,然后调试必将引入的死锁之前,谁能推荐bash/shell或其他工具机制来实现这一目标.

解决方案

绝对不需要自己编写此工具,有几种不错的选择.

make

make可以很容易地做到这一点,但是它确实广泛依赖于文件来驱动该过程. (如果要在生成输出文件的每个输入文件上运行某些操作,这可能会很棒.)-j命令行选项将运行指定数量的任务,而-l负载平均命令行选项将运行指定在开始新任务之前必须满足的系统平均负载. (如果您想在后台执行某些工作,这可能会很好.请不要忘记nice(1)命令,该命令也可以在这里提供帮助.)

因此,一种快速(未经测试)的Makefile用于图像转换:

ALL=$(patsubst cimg%.jpg,thumb_cimg%.jpg,$(wildcard *.jpg))

.PHONY: all

all: $(ALL)
        convert $< -resize 100x100 $@

如果使用make运行此命令,它将一次运行一次.如果使用make -j8运行,它将运行八个单独的作业.如果运行make -j,它将启动数百个. (在编译源代码时,我发现双核数是一个很好的起点.这使每个处理器在等待磁盘IO请求时都可以做一些事情.不同的机器和不同的负载可能会以不同的方式工作.)

xargs

xargs提供--max-procs命令行选项.如果可以使用ascii NUL分隔的输入命令或换行分隔的输入命令根据单个输入流将并行进程分开,则最好. (好吧,-d选项使您可以选择其他东西,但是这两者很常见且很容易.)这使您受益于使用find(1)强大的文件选择语法,而不是编写诸如Makefile之类的有趣表达式.上面的示例,或者让您的输入与文件完全无关. (考虑一下,如果您有一个程序可以将大量的复合数作为素数分解为因子,那么使该任务适合make的情况充其量是很尴尬的.xargs可以轻松地做到这一点.)

前面的示例可能看起来像这样:

find . -name '*jpg' -print0 | xargs -0 --max-procs 16 -I {} convert {} --resize 100x100 thumb_{}

parallel

moreutils软件包(至少在Ubuntu上可用)提供了parallel命令.它可以以两种不同的方式运行:要么在不同的参数上运行指定的命令,要么并行运行不同的命令.前面的示例可能如下所示:

parallel -i -j 16 convert {} -resize 100x100 thumb_{} -- *.jpg

beanstalkd

beanstalkd 程序采用了一种完全不同的方法:它提供了消息总线,以便您向其中提交请求,并且作业服务器阻止正在输入的作业,执行作业,然后返回等待队列中的新作业.如果您想将数据写回到启动作业的特定HTTP请求中,这可能不太方便,因为您必须自己提供该机制(也许beanstalkd服务器上有一个不同的"tube"),但是如果最终结果是将数据提交到数据库,电子邮件或类似的异步方式,这可能最容易集成到现有应用程序中.

I need to limit the number of processes being executed in parallel. For instance I'd like to execute this psuedo-command line:

export POOL_PARALLELISM=4
for i in `seq 100` ; do
    pool foo -bar &
done

pool foo -bar # would not complete until the first 100 finished.

Therefor despite 101 foos being queued up to run, only 4 would be running at any given time. pool would fork()/exit() and queue the remaining processes until complete.

Is there a simple mechanism to do this with Unix tools? at and batch don't apply because they generally invoke on the top of the minute as well as execute jobs sequentially. Using a queue is not necessarily the best because I want these synchronous.

Before I write a C wrapper employing semaphores and shared memory and then debug deadlocks that I'll surely introduce, can anyone recommend a bash/shell or other tool mechanism to accomplish this.

解决方案

There's definitely no need to write this tool yourself, there's several good choices.

make

make can do this pretty easy, but it does rely extensively on files to drive the process. (If you want to run some operation on every input file that produces an output file, this might be awesome.) The -j command line option will run the specified number of tasks and the -l load-average command line option will specify a system load average that must be met before starting new tasks. (Which might be nice if you wanted to do some work "in the background". Don't forget about the nice(1) command, which can also help here.)

So, a quick (and untested) Makefile for image converting:

ALL=$(patsubst cimg%.jpg,thumb_cimg%.jpg,$(wildcard *.jpg))

.PHONY: all

all: $(ALL)
        convert $< -resize 100x100 $@

If you run this with make, it'll run one-at-a-time. If you run with make -j8, it'll run eight separate jobs. If you run make -j, it'll start hundreds. (When compiling source code, I find that twice-the-number-of-cores is an excellent starting point. That gives each processor something to do while waiting for disk IO requests. Different machines and different loads might work differently.)

xargs

xargs provides the --max-procs command line option. This is best if the parallel processes can be divided apart based on a single input stream with either ascii NUL separated input commands or new-line separated input commands. (Well, the -d option lets you pick something else, but these two are common and easy.) This gives you the benefit of using find(1)'s powerful file-selection syntax rather than writing funny expressions like the Makefile example above, or lets your input be completely unrelated to files. (Consider if you had a program for factoring large composite numbers in prime factors -- making that task fit into make would be awkward at best. xargs could do it easily.)

The earlier example might look something like this:

find . -name '*jpg' -print0 | xargs -0 --max-procs 16 -I {} convert {} --resize 100x100 thumb_{}

parallel

The moreutils package (available at least on Ubuntu) provides the parallel command. It can run in two different ways: either running a specified command on different arguments, or running different commands in parallel. The previous example could look like this:

parallel -i -j 16 convert {} -resize 100x100 thumb_{} -- *.jpg

beanstalkd

The beanstalkd program takes a completely different approach: it provides a message bus for you to submit requests to, and job servers block on jobs being entered, execute the jobs, and then return to waiting for a new job on the queue. If you want to write data back to the specific HTTP request that initiated the job, this might not be very convenient, as you have to provide that mechanism yourself (perhaps a different 'tube' on the beanstalkd server), but if the end result is submitting data into a database, or email, or something similarly asynchronous, this might be the easiest to integrate into your existing application.

这篇关于同步Unix池化进程有什么简单的机制?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆