在Dask中预分散数据对象是否有优势? [英] Is there an advantage to pre-scattering data objects in Dask?

查看:67
本文介绍了在Dask中预分散数据对象是否有优势?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

如果我将数据对象预分散到多个工作节点上,是否会将它完整地复制到每个工作节点上?如果该数据对象很大,这样做有好处吗?

If I pre-scatter a data object across worker nodes, does it get copied in its entirety to each of the worker nodes? Is there an advantage in doing so if that data object is big?

futures界面为例:

client.scatter(data, broadcast=True)
results = dict()
for i in tqdm_notebook(range(replicates)):
    results[i] = client.submit(nn_train_func, data, **params)

delayed界面为例:

client.scatter(data, broadcast=True)
results = dict()
for i in tqdm_notebook(range(replicates)):
    results[i] = delayed(nn_train_func, data, **params)

我问的原因是因为我注意到以下现象:

The reason I ask is because I noticed the following phenomena:

  1. 如果我预分散数据,则delayed似乎会将数据重新发送到工作程序节点,从而使内存使用量大约增加一倍.看来预分散并没有按照我的预期去做,这允许工作节点引用预分散的数据.
  2. futures接口需要很长时间才能遍历循环(明显更长).我目前不确定如何确定瓶颈在哪里.
  3. 使用delayed接口,从调用compute()函数的时间到在仪表板上反映活动的时间,存在很大的延迟,我怀疑这是由于数据复制造成的.
  1. If I pre-scatter the data, delayed appears to re-send data to the worker nodes, thus approximately doubling memory usage. It appears that pre-scattering is not doing what I expected it to do, which is allow for the worker nodes to reference the pre-scattered data.
  2. The futures interface takes a long time to iterate through the loop (significantly longer). I am currently not sure how to identify where the bottleneck here is.
  3. Using the delayed interface, from the time the compute() function is called to the time that activity is reflected on the dashboard, there is an extensive delay, which I suspected was due to data copying.

推荐答案

预分散旨在避免将大对象数据放入任务图中.

Pre-scattering is designed to avoid placing large object data into the task graph.

x = np.array(lots_of_data)
a = client.submit(add, x, 1)  # have to send all of x to the scheduler
b = client.submit(add, x, 2)  # again
c = client.submit(add, x, 3)  # and again

您会感到痛苦,因为client.submit的返回速度很慢,并且Dask甚至可能会发出警告.

You'll feel this pain because client.submit will be slow to return, and Dask may even throw a warning.

因此,我们分散了数据,获得了回报

So instead we scatter our data, receiving a future in return

x = np.array(lots_of_data)
x_future = client.scatter(x)
a = client.submit(add, x_future, 1)  # Only have to send the future/pointer
b = client.submit(add, x_future, 2)  # so this is fast
c = client.submit(add, x_future, 3)  # and this

在您的情况下,您几乎要 这样做,唯一的区别是您分散了数据,然后忘记了将来返回的数据,然后再次发送数据.

In your situation you're almost doing this, the only difference is that you scatter your data, then forget about the future it returns, and send your data again.

client.scatter(data, broadcast=True)  # whoops!  forgot to capture the output
data = client.scatter(data, broadcast=True)  # data is now a future pointing to its remote value

您可以选择是否选择broadcast.如果您知道所有工作人员都需要此数据,那么这不是一件坏事,但是无论如何都会没事的.

You can choose to broadcast or not. If you know that all of your workers will need this data then it's not a bad thing to do, but things will be fine regardless.

这篇关于在Dask中预分散数据对象是否有优势?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆