dask:client.persist和client.compute之间的区别 [英] dask: difference between client.persist and client.compute

查看:314
本文介绍了dask:client.persist和client.compute之间的区别的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我对 client.persist() client.compute()之间的区别感到困惑(在某些情况下)似乎开始我的计算,并且都返回异步对象,但是在我的简单示例中没有:

I am confused about what the difference is between client.persist() and client.compute() both seem (in some cases) to start my calculations and both return asynchronous objects, however not in my simple example:

在此示例中

from dask.distributed import Client
from dask import delayed
client = Client()

def f(*args):
    return args

result = [delayed(f)(x) for x in range(1000)]

x1 = client.compute(result)
x2 = client.persist(result)

此处 x1 x2 是不同的,但是计算起来比较简单,其中 result 也是使用 client.persist(result) Delayed 对象开始计算,就像 client.compute (结果)

Here x1 and x2 are different but in a less trivial calculation where result is also a list of Delayed objects, using client.persist(result) starts the calculation just like client.compute(result) does.

推荐答案

相关文档页面在此处: http://distributed.readthedocs.io/en/latest/manage-computation.html#dask-collections-to-futures

Relevant doc page is here: http://distributed.readthedocs.io/en/latest/manage-computation.html#dask-collections-to-futures

正如您所说, Client.compute Client.persist 收集懒惰的Dask集合并启动它们在集群上运行。它们在返回内容上有所不同。

As you say, both Client.compute and Client.persist take lazy Dask collections and start them running on the cluster. They differ in what they return.


  1. Client.persist返回每个dask集合的副本,其中包含以前的惰性计算。现在提交以在集群上运行。这些集合的任务图现在仅指向当前正在运行的Future对象。

  1. Client.persist returns a copy for each of the dask collections with their previously-lazy computations now submitted to run on the cluster. The task graphs of these collections now just point to the currently running Future objects.

因此,如果您坚持使用100个分区的dask数据帧,则会返回100个分区的
a dask数据帧,每个分区都指向
a future当前正在运行

So if you persist a dask dataframe with 100 partitions you get back a dask dataframe with 100 partitions, with each partition pointing to a future currently running on the cluster.

Client.compute为每个集合返回一个Future。未来指的是在一个工作程序上收集的单个Python对象结果。通常用于较小的结果。

Client.compute returns a single Future for each collection. This future refers to a single Python object result collected on one worker. This typically used for small results.

因此,如果您计算一个包含100个分区的dask.dataframe,则会返回一个Future,该Future指向一个包含所有数据的单个Pandas数据帧。

So if you compute a dask.dataframe with 100 partitions you get back a Future pointing to a single Pandas dataframe that holds all of the data

更务实的是,我建议您在结果较大时使用persist,并且需要在多台计算机之间分布;当结果较小且您使用

More pragmatically, I recommend using persist when your result is large and needs to be spread among many computers and using compute when your result is small and you want it on just one computer.

在实践中,我很少使用 Client.compute ,而是倾向于使用persist中间阶段和 dask.compute 提取最终结果。

In practice I rarely use Client.compute, preferring instead to use persist for intermediate staging and dask.compute to pull down final results.

df = dd.read_csv('...')
df = df[df.name == 'alice']
df = df.persist()  # compute up to here, keep results in memory

>>> df.value.max().compute()
100

>>> df.value.min().compute()
0



使用延迟时



延迟的对象无论如何都只有一个分区,因此计算和持久性更加可互换。坚持会给你一个懒惰的dask.delayed对象,而计算会给你一个立即的Future对象。

When using delayed

Delayed objects only have one "partition" regardless, so compute and persist are more interchangble. Persist will give you back a lazy dask.delayed object while compute will give you back an immediate Future object.

这篇关于dask:client.persist和client.compute之间的区别的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆