并行流程的沟通:我有什么选择? [英] Communication of parallel processes: what are my options?

查看:111
本文介绍了并行流程的沟通:我有什么选择?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试更深入地研究R例程的并行化.

关于一堆与工人"过程有关的沟通,我有哪些选择

  1. 各人之间的沟通 ?
  2. 工人与"主人"过程的沟通?

AFAIU,没有"共享环境/共享内存"这样的东西,主机和所有工作进程都可以访问,对吗?

到目前为止,我想到的最好的主意是使通信基于将JSON文档读写到硬盘驱动器上.这可能是个坏主意;-)我选择.json而不是.Rdata文件,因为JSON似乎经常用于软件间通信,因此我认为应该使用该标准".

期待了解更好的选择!

仅供参考:我通常根据基本软件包 parallel 和contrib软件包解决方案

对于进程之间的通信,一种有趣的开始是帮助页面?socketConnections和块中标记为"## Not run:"的代码. .因此,开始一个R进程并运行

 con1 <- socketConnection(port = 6011, server=TRUE)

此过程充当服务器,在特定端口上侦听某些信息.现在开始第二个R进程,然后输入

 con2 <- socketConnection(Sys.info()["nodename"], port = 6011)

进程2中的

con2与进程1上的con1建立了套接字连接.回到con1,写出R对象LETTERS

writeLines(LETTERS, con1)

并在con2上检索它们.

readLines(con2)

因此,您已经在进程之间进行了通信,而没有写入磁盘.这里也隐含了一些重要的概念,例如,关于阻塞连接与非阻塞连接.只要端口可以在计算机所在的任何网络上访问,它都不仅限于同一台计算机上的通信.这是并行包中makePSOCKcluster的基础,此外,进程1实际上使用system命令和并行包中的脚本来启动进程2.makePSOCKcluster返回的对象是可子设置的,这样您就可以将集群的一小部分用于解决特定任务.原则上,您可以安排生成的节点相互通信,而与进行生成的节点无关.

一个有趣的练习是使用parallel包中的fork-like命令(在非Windows上)执行相同的操作.帮助页面?mcparallel的高级版本,例如

 p <- mcparallel(1:10)
 q <- mcparallel(1:20)
 # wait for both jobs to finish and collect all results
 res <- mccollect(list(p, q))

但这是建立在较低级别的sendMaster和朋友(在mcparallelmccollect源代码上达到峰值)的基础上.

Rmpi​​软件包采用类似于PSOCK示例的方法,其中管理器使用脚本生成工作程序,并使用mpi而不是套接字进行通信.但是,如果您具有正常运行的MPI实施,则值得在周末进行的项目的另一种方法是实施一个脚本,该脚本对不同的数据进行相同的计算,然后使用mpi.comm.rankmpi.send.Robjmpi.recv.Robj.

一个有趣的周末项目将使用并行包来实现一个涉及并行计算但不涉及mclapply种类的工作流程,例如,其中一个流程从网站中获取数据,然后将其传递给另一个绘制漂亮图片的流程.第一个过程的输入很可能是JSON,但是R中的通信可能更适合R数据对象.

I'm trying to dig a bit deeper into parallelziation of R routines.

What are my options with respect to the communication of a bunch of "worker" processes regarding

  1. the communication between the respective workers?
  2. the communication of the workers with the "master" process?

AFAIU, there's no such thing as a "shared environment/shared memory" that both the master as well as all worker processes have access to, right?

The best idea I came up with so far is to base the communication on reading and writing JSON documents to the hard drive. That's probably a bad idea ;-) I chose .json over .Rdata files because JSON seems to be used for inter-software communication a lot, so I thought to go with that "standard".

Looking forward to learning about better options!

FYI: I'm usually parallelizing based on functions of the base package parallel and the contrib package snowfall, mainly relying on function sfClusterApplyLB() to get the job done

EDIT

I should have stated that I'm running on Windows, but Linux-based answers/hints are also very much appreciated!

解决方案

For communication between processes, a kind of fun place to start is the help page ?socketConnections and the code in the chunk marked "## Not run:". So start an R process and run

 con1 <- socketConnection(port = 6011, server=TRUE)

This process is acting as a server, listening on a particular port for some information. Now start a second R process and enter

 con2 <- socketConnection(Sys.info()["nodename"], port = 6011)

con2 in process 2 has made a socket connection with con1 on process 1. Back at con1, write out the R object LETTERS

writeLines(LETTERS, con1)

and retrieve them on con2.

readLines(con2)

So you've communicated between processes without writing to disk. Some important concepts are also implicit here, e.g., about blocking vs. non-blocking connections, It is not limited to communication on the same machine, provided the ports are accessible across whatever network the computers are on. This is the basis for makePSOCKcluster in the parallel package, with the addition that process 1 actually uses the system command and a script in the parallel package to start process 2. The object returned by makePSOCKcluster is sub-settable, so that you can dedicate a fraction of your cluster to solving a particular task. In principle you could arrange for the spawned nodes to communicate with one another independent of the node that did the spawning.

An interesting exercise is to do the same using the fork-like commands in the parallel package (on non-Windows). A high-level version of this is in the help page ?mcparallel, e.g.,

 p <- mcparallel(1:10)
 q <- mcparallel(1:20)
 # wait for both jobs to finish and collect all results
 res <- mccollect(list(p, q))

but this builds on top of lower-level sendMaster and friends (peak at the mcparallel and mccollect source code).

The Rmpi package takes an approach like the PSOCK example, where the manager uses scripts to spawn workers, and with communication using mpi rather than sockets. But a different approach, worthy of a weekend project if you have a functioning MPI implementation, is to implement a script that does the same calculation on different data, and then collates results onto a single node, using commands like mpi.comm.rank, mpi.barrier, mpi.send.Robj, and mpi.recv.Robj.

A fun weekend project would use the parallel package to implement a work flow that involved parallel computation but not of the mclapply variety, e.g., where one process harvests data from a web site and then passes it to another process that draws pretty pictures. The input to the first process might well be JSON, but the communication within R is probably much more appropriately R data objects.

这篇关于并行流程的沟通:我有什么选择?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆