Amazon AWS-适用于初学者的Python [英] Amazon AWS - python for beginners

查看:166
本文介绍了Amazon AWS-适用于初学者的Python的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个计算量大的程序,要进行我打算并行化的计算.它是用python编写的,我希望使用 multiprocess 模块.我希望获得一些帮助,以帮助我了解如何在笔记本电脑上运行一个程序来控制整个过程.

I have a computationally intensive program doing calculations that I intend to parallelise. It is written in python and I hope to use the multiprocess module. I would like some help with understanding what I would need to do to have one program run from my laptop controlling the entire process.

关于可以使用的计算机,我有两个选择.一种是计算机,我可以从终端通过ssh user@comp1.com访问(不确定如何通过python访问它们),然后在该计算机上运行实例,尽管我想用一种更具编程性的方式来访问它们.看来,如果我运行了远程管理器输入应用程序会起作用吗?

I have two options in terms of what computers I can use. One is computers which I can access through ssh user@comp1.com from the terminal ( not sure how to access them through python ) and then run the instance there, although I'd like a more programmatic way to get to them than that. It seems that if I ran a remote manager type application it would work?

我当时想的第二个选择是利用AWS E2C服务器. (我认为这是我所需要的).我发现 boto ,我从未使用过,但似乎提供了控制AWS系统的界面.我觉得然后我需要一些东西才能在AWS上实际分发作业,可能与选项1(?)类似.我在这里有点昏暗.

The second option I was thinking is utilising AWS E2C servers. (I think that is what I need). And I found boto which I've never used but seemed to provide an interface to control the AWS system. I feel that I would then need something to actually distribute jobs on AWS, probably similarly as option 1 (?). I'm a bit in the dark here.

让您了解它的并行性:

res = []
for param in Parameters:
    res.append(FunctionA(param))
Parameters2 = FunctionB(res)
res2 = []
for param in Parameters2:
    res2.append(FunctionC(param))
return res, res2

因此,这两个循环基本上是我可以发送许多要并行运行的param值的地方,并且我知道如何重新组合它们以创建res,只要我知道它们来自哪个param即可.然后,我需要将它们全部组合在一起以得到Parameters2,然后第二部分再次可并行化.

So the two loops are basically where I can send off many param values to be run in parallel and I know how to recombine them to create res as long as I know which param they came from. Then I need to group them all together to get Parameters2 and then the second part is again parallelisable.

推荐答案

仅当您希望进程共享内存中的数据时,才要使用多进程模块.仅出于性能考虑,您绝对必须拥有共享内存时,才建议您这样做. python多进程应用程序很容易编写和调试.

you would want to use the multiprocess module only if you want the processes to share data in memory. That is something I would recommend ONLY if you absolutely have to have shared memory due to performance considerations. python multiprocess applications are non-trivial to write and debug.

如果您正在执行诸如Distributed.net或seti @ home项目之类的事情,即使这些任务在计算上是合理的,它们也被合理地隔离开来,您可以按照以下过程进行操作.

If you are doing something like the distributed.net or seti@home projects, where even though the tasks are computationally intenive they are reasonably isolated, you can follow the following process.

  1. 创建一个主应用程序,它将大型任务分解为较小的计算块(假设可以分解任务,然后可以集中组合结果).
  2. 创建将从服务器接收任务的python代码(可能是文件或其他一些有关执行操作的一次性通讯),并运行这些python进程的多个副本
  3. 这些python进程将彼此独立工作,处理数据,然后将结果返回给主进程以进行结果校对.

您可以根据需要在AWS单核实例上运行这些进程,也可以使用笔记本电脑运行尽可能多的副本,以备不时之需.

you could run these processes on AWS single core instances if you wanted, or use your laptop to run as many copies as you have cores to spare.

基于更新后的问题

因此,您的主进程将创建将在其中包含参数信息的文件(或其他数据结构).您要处理的参数数量与文件一样多.该文件将存储在名为"need-work"的共享文件夹中

So your master process will create files (or some other data structures) that will have the parameter info in them. As many files as you have params to process. This files will be stored in a shared folder called needed-work

每个python worker(在AWS实例上)将查看需要工作的共享文件夹,寻找可用的文件进行处理(或等待套接字等待主进程将文件分配给它们).

Each python worker (on AWS instances) will look at the needed-work shared folder, looking for available files to work on (or wait on a socket for the master process to assign the file to them).

处理需要处理的文件的python进程将对其进行处理,并将结果存储在单独的共享文件夹中,并将参数作为文件结构的一部分.

The python process that takes on a file that needs work, will work on it and store the result in a separate shared folder with the the parameter as part of the file structure.

主进程将查看work-done文件夹中的文件,处理这些文件并生成组合响应

The master process will look at the files in the work-done folder, process these files and generate the combined response

整个解决方案也可以实现为套接字,工作人员将侦听套接字,以便主服务器将工作分配给它们,而主服务器将在套接字上等待工人,以便提交响应.

This whole solution could be implemented as sockets as well, where workers will listen to sockets for the master to assign work to them, and master will wait on a socket for the workers so submit response.

基于文件的方法将要求工作人员确保自己承担的工作不会由其他工作人员承担.可以通过为每个工作人员拥有单独的工作文件夹来解决此问题,而主流程将决定何时需要为该工作人员进行更多工作.

The file based approach would require a way for the workers to make sure that the work they pick up is not taken on by another worker. This could be fixed by having separate work folders for each worker and the master process would decided when there needs to be more work for the worker.

工作人员可以删除他们从工作文件夹中拾取的文件,并且主过程可以监视文件夹何时为空,并向其中添加更多工作文件.

Workers could delete files that they pick up from the work folder and master process could keep a watch on when a folder is empty and add more work files to it.

如果您愿意的话,再次使用套接字可以做到这一点.

Again more elegant to do this using sockets if you are comfortable with that.

这篇关于Amazon AWS-适用于初学者的Python的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆