如何在dask/distributed中存储工人局部变量 [英] how to store worker-local variables in dask/distributed

查看:53
本文介绍了如何在dask/distributed中存储工人局部变量的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

使用dask 0.15.0,分发了1.17.1.

Using dask 0.15.0, distributed 1.17.1.

我想记住每个工作人员的一些事情,例如访问Google云存储的客户端,因为实例化它很昂贵.我宁愿将此存储在某种worker属性中.做到这一点的规范方法是什么?还是全球人走的路?

I want to memoize some things per worker, like a client to access google cloud storage, because instantiating it is expensive. I'd rather store this in some kind of worker attribute. What is the canonical way to do this? Or are globals the way to go?

推荐答案

在工作人员上

您可以通过 get_worker 功能.比将全局变量更干净的事情是将状态附加到工人:

On the worker

You can get access to the local worker with the get_worker function. A slightly cleaner thing than globals would be to attach state to the worker:

from dask.distributed import get_worker

def my_function(...):
    worker = get_worker()
    worker.my_personal_state = ...

future = client.submit(my_function, ...)

我们可能应该在worker上添加一个通用的名称空间变量,以用作此类信息的通用场所,但目前还没有.

We should probably add a generic namespace variable on workers to serve as a general place for information like this, but haven't yet.

尽管如此,对于与外部服务的连接这样的事情,全球并不完全是邪恶的.龙卷风之类的许多系统都使用全局单例.

That being said though, for things like connections to external services globals aren't entirely evil. Many systems like Tornado use global singletons.

请注意,工作程序通常是多线程的.如果您的连接对象不是线程安全的,则可能需要为每个线程缓存一个不同的对象.为此,我建议使用 threading.local 对象.达斯在

Note that workers are often multi-threaded. If your connection object isn't threadsafe then you may need to cache a different object per-thread. For this I recommend using a threading.local object. Dask uses one at

from distributed.worker import thread_state

这篇关于如何在dask/distributed中存储工人局部变量的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆