ipython notebook:如何并行化外部脚本 [英] ipython notebook : how to parallelize external script

查看:277
本文介绍了ipython notebook:如何并行化外部脚本的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试使用ipython并行库中的并行计算。但是我对它知之甚少,而且我发现很难从那些对并行计算一无所知的人那里读到这些文档。



有趣的是,我发现的所有教程都只是重复使用文档中的示例,但从我的观点来看,这些解释是无用的。



基本上我想做的是在后台运行几个脚本,以便它们在同一时间执行。在bash中,它将类似于:

  for $(cat list_file)中的my_file;做
python pgm.py my_file&
done

但Ipython笔记本的bash解释器无法处理后台模式。



似乎解决方案是使用来自ipython的并行库。



我试过:

 来自IPython.parallel import客户端
rc = Client()
rc.block = True
dview = rc [: 2]#我只带2个引擎

但是我被卡住了。我不知道如何同时运行两次(或更多)相同的脚本或pgm。



谢谢。

解决方案

一年后,我终于得到了我想要的东西。



1)使用您想要在不同的cpu上执行的操作创建一个函数。这里只是使用 magic ipython命令从bash调用脚本。我想它适用于 call()函数。

  def my_func(my_file):
!python pgm.py {my_file}

不要忘记使用 {}



另请注意 my_file 的路径应该是绝对的,因为群集是您启动笔记本的地方(当执行 jupyter notebook ipython notebook )这不一定是你的位置。



2)开始你的ipython笔记本群集与您想要的CPU数量。
等待2并执行以下单元格:

 来自IPython import parallel 
rc = parallel.Client( )
view = rc.load_balanced_view()

3)获取您要处理的文件列表:

  files = list_of_files 

4)将您的所有文件异步映射到视图你刚刚创建的引擎。 (不确定措辞)。

  r = view.map_async(my_func,files)

当它正在运行时,您可以在笔记本上执行其他操作(它以后台运行!)。您还可以调用 r.wait_interactive(),它以交互方式枚举已处理的文件数,到目前为止所花费的时间和剩余的文件数。这将阻止你运行其他单元格(但你可以打断它)。



如果你有比引擎更多的文件,不用担心,它们会尽快处理一个引擎完成一个文件。



希望这会对其他人有帮助!



本教程可能会有所帮助:



http://nbviewer.ipython.org/github/minrk/IPython-parallel-tutorial/blob/master/Index.ipynb



<请注意,我仍然有 IPython 2.3.1 ,我不知道它是否因为 Jupyter 而改变了。



编辑:仍与Jupyter合作,参见这里是您可能遇到的差异和潜在问题






请注意,如果您使用外部在你的函数库中,你需要我使用以下内容在不同的引擎上进行操作:

 %px import numpy as np 

  %% px 
import numpy为np
导入pandas为pd

与变量和其他函数相同,你需要将它们推送到引擎名称空间:

  rc [:]。push(dict(
foo = foo,
bar = bar))





I'm trying to use parallel computing from ipython parallel library. But I have little knowledge about it and I find the doc difficult to read from someone who knows nothing about parallel computing.

Funnily, all tutorials I found just re-use the example in the doc, with the same explanation, which from my point of view, is useless.

Basically what I'd like to do is running few scripts in background so they are executed in the same time. In bash it would be something like :

for my_file in $(cat list_file); do
    python pgm.py my_file &
done

But bash interpreter of Ipython notebook doesn't handle the background mode.

It seems that solution was to use parallel library from ipython.

I tried :

from IPython.parallel import Client
rc = Client()
rc.block = True
dview = rc[:2] # I take only 2 engines

But then I'm stuck. I don't know how to run twice (or more) the same script or pgm at the same time.

Thanks.

解决方案

One year later, I eventually managed to get what I wanted.

1) Create a function with what you want to do on the different cpu. Here it is just calling a script from the bash with the ! magic ipython command. I guess it would work with the call() function.

def my_func(my_file):
    !python pgm.py {my_file}

Don't forget the {} when using !

Note also that the path to my_file should be absolute, since the clusters are where you started the notebook (when doing jupyter notebook or ipython notebook) which is not necessarily where you are.

2) Start your ipython notebook Cluster with the number of CPU you want. Wait 2s and execute the following cell:

from IPython import parallel
rc = parallel.Client()
view = rc.load_balanced_view()

3) Get a list of file you want to process:

files = list_of_files

4) Map asynchronously your function with all your files to the view of your engines you just created. (not sure of the wording).

r = view.map_async(my_func, files)

While it's running you can do something else on the notebook (It runs in "background"!). You can also call r.wait_interactive() that enumerates interactively the number of files processed and the number of time spent so far and the number of files left. This will prevent you to run other cells (but you can interrupt it).

And if you have more files than engines, no worries, they will be processed as soon as an engine finishes with 1 file.

Hope this will help others !

This tutorial might be of some help:

http://nbviewer.ipython.org/github/minrk/IPython-parallel-tutorial/blob/master/Index.ipynb

Note also that I still have IPython 2.3.1, I don't know if it changed since Jupyter.

Edit: Still works with Jupyter, see here for difference and potential issues you may encounter


Note that if you use external libraries in your function, you need to import them on the different engines with:

%px import numpy as np

or

%%px
import numpy as np
import pandas as pd

Same with variable and other functions, you need to push them to the engine name space:

rc[:].push(dict(
                foo=foo,
                bar=bar))


这篇关于ipython notebook:如何并行化外部脚本的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆