我可以使用从Dask / Distributed中的.py文件导入的函数吗? [英] Can I use functions imported from .py files in Dask/Distributed?

查看:80
本文介绍了我可以使用从Dask / Distributed中的.py文件导入的函数吗?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我对序列化和导入有疑问。

I have a question about serialization and imports.


  • 函数应该有自己的输入吗? 像我对PySpark所做的

  • 以下仅仅是明显的错误吗? mod.py 是否需要是conda / pip软件包? mod.py 已写入共享文件系统。

  • should functions have their own imports? like I've seen done with PySpark
  • Is the following just plain wrong? Does mod.py need to be a conda/pip package? mod.py was written to a shared filesystem.

In [1]: from distributed import Executor

In [2]: e = Executor('127.0.0.1:8786')

In [3]: e
Out[3]: <Executor: scheduler="127.0.0.1:8786" processes=2 cores=2>

In [4]: import socket

In [5]: e.run(socket.gethostname)
Out[5]: {'172.20.12.7:53405': 'n1015', '172.20.12.8:53779': 'n1016'}

In [6]: %%file mod.py
   ...: def hostname():
   ...:     return 'the hostname'
   ...: 
Overwriting mod.py

In [7]: import mod

In [8]: mod.hostname()
Out[8]: 'the hostname'

In [9]: e.run(mod.hostname)
distributed.utils - ERROR - No module named 'mod'


推荐答案

快速解答



将您的mod.py文件上传到所有工作人员。您可以使用用于设置dask.distributed的任何机制来执行此操作,也可以使用上传文件方法

e.upload_file('mod.py')

或者,如果您的函数是在IPython中创建的,而不是作为模块的一部分,则它将随同发送没问题。

Alternatively, if your function is made in IPython, rather than being part of a module, it will be sent along without a problem.

这一切与如何在Python中序列化函数有关。来自模块的函数按其模块名称和函数名称进行序列化

This all has to do with how functions get serialized in Python. Functions from modules are serialized by their module name and function name

In [1]: from math import sin

In [2]: import pickle

In [3]: pickle.dumps(sin)
Out[3]: b'\x80\x03cmath\nsin\nq\x00.'

因此,如果客户端计算机要引用 math.sin 函数沿着该字节串发送(您会注意到具有'math''sin 埋在其他字节中)到工作机。工作人员看了看这个字节串,然后说:好吧,我想要的功能在这样的模块中,让我继续在本地文件系统中找到它。如果该模块不存在,则会引发错误。

So if the client machine wants to refer to the math.sin function it sends along this bytestring (which you'll notice has 'math' and 'sin' in it buried among other bytes) to the worker machine. The worker looks at this bytestring and says "OK great, the function I want is in such and such a module, let me go and find that in my local file system. If the module isn't present then it'll raise an error, much like what you received above.

对于动态创建的函数(您在IPython中创建的函数),它使用完全不同的方法来捆绑所有代码。

For dynamically created functions (functions that you make in IPython) it uses a completely different approach, bundling up all of the code. This approach generally works fine.

通常来说,Dask假定工作人员和客户端都具有相同的软件环境,通常这主要是由设置群集的人使用其他一些工具,例如Docker。当文件或脚本的更新频率更高时,像 upload_file 这样的方法可以填补空白。

Generally speaking Dask assumes that the workers and the client all have the same software environment. Typically this is mostly handled by whoever sets up your cluster, using some other tool like Docker. Methods like upload_file are there to fill in the gaps when you have files or scripts that get updated more frequently.

这篇关于我可以使用从Dask / Distributed中的.py文件导入的函数吗?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆