直接从敏捷工作者写输出 [英] Write output directly from a dask worker

查看:49
本文介绍了直接从敏捷工作者写输出的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个转换(映射)数据框的管道.输出很大-输入数据框中的行包含二进制格式的音频,输出数据框中的行包含提取的二进制特征.

I have a pipeline that transforms (maps) a dataframe. The output is large - rows in the input dataframe contain audio in binary format and rows in the output dataframe contain extracted binary features.

我正在从分区的实木复合地板文件读取输入,并将其写回到不同的实木复合地板文件-两者都在网络共享上.

I'm reading the input from a partitioned parquet file and writing it back to a different parquet file(s) - both on a network share.

根据我的理解,在分布式dask中,每个工作人员都会将输出发送回调度程序(然后可能是调度程序将其发送回客户端???),然后调度程序(或客户端)才将其写入到网络共享.这是正确的吗?

From my understanding, in distributed dask, each worker will send the output back to the scheduler (and then maybe the scheduler sends it back to the client??) and only then will the scheduler (or the client) write it to the network share. Is this correct?

如果是,如果数据很大且带宽问题,那么在这种情况下似乎存在冗余通信-为什么工人不能将输出直接发送到最终目的地(在这种情况下为网络共享)?当然,该共享必须对所有工作人员都可用,并且某人需要同步写入,但这不是dask的魔力所在吗?

If yes, if the data is big and bandwidth is an issue it seems there is redundant communication in this scenario - why can't the workers send the output directly to the final destination (network share in this case)? Certainly, the share needs to be available to all workers, and someone needs to synchronize the writes, but isn't this what the magic of dask is about?

推荐答案

您的理解是不正确的:工作人员将直接读写共享存储或云/网络服务,这是计算事物的正常方法.

Your understanding is incorrect: the workers will read and write to shared storage or cloud/network services directly, this is the normal way that things are calculated.

df = dd.read_parquet(url)
df_out = do_work(df)
df_out.to_parquet(url2)

在这种情况下,调度程序或客户端永远不会看到数据.但是,他们进行通信:客户端将加载有关数据集的元数据,以便可以推断出如何拆分要完成的工作,并且调度程序会与客户端和工作人员进行对话找出这些任务规范并检查它们何时完成.

In this case, the data is never seen by the scheduler or the client. They do communicate, though: the client will load metadata about the dataset, so that it can make inferences about how to split up the work to be done, and the scheduler talks to both the client and the workers to farm out these task specifications and check when they are done.

您可以选择使用

local_df = df.compute()

但是这是可选的,并且显然不建议在数据大小大于内存的情况下使用.通常,您永远不需要为整个数据集执行此操作,而只需要为某些聚合结果比原始数据小得多即可.即使在这种情况下,调度程序本身也不会存储结果.

but this is optional and obviously not recommended where the data size is bigger than memory. You usually never need to do this for the whole dataset, only maybe for some aggregate result much smaller than the original. Even in this case, the scheduler itself does not store the results.

这篇关于直接从敏捷工作者写输出的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆