Dask是否与HDFS通信以优化数据局部性? [英] Does Dask communicate with HDFS to optimize for data locality?

查看:157
本文介绍了Dask是否与HDFS通信以优化数据局部性?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在Dask中分发了文档,它们具有以下信息:

In Dask distributed documentation, they have the following information:

例如,Dask开发人员使用此功能来建立数据局部性 当我们与本地数据存储系统(如Hadoop File)进行通信时 系统.当用户使用高级功能时,例如 dask.dataframe.read_csv('hdfs:///path/to/files.*.csv')Dask与 HDFS名称节点,找到所有数据块的位置, 并将该信息发送到调度程序,以便它可以使 更明智的决策并缩短用户的加载时间.

For example Dask developers use this ability to build in data locality when we communicate to data-local storage systems like the Hadoop File System. When users use high-level functions like dask.dataframe.read_csv('hdfs:///path/to/files.*.csv') Dask talks to the HDFS name node, finds the locations of all of the blocks of data, and sends that information to the scheduler so that it can make smarter decisions and improve load times for users.

但是,似乎get_block_locations()已从HDFS fs后端中删除,所以我的问题是:Dask关于HDFS的当前状态是什么?是否将计算发送到本地数据节点?是否在优化调度程序时考虑了HDFS上的数据局部性?

However, it seems that the get_block_locations() was removed from the HDFS fs backend, so my question is: what is the current state of Dask regarding to HDFS ? Is it sending computation to nodes where data is local ? Is it optimizing the scheduler to take into account data locality on HDFS ?

推荐答案

很正确,箭头的HDFS界面的外观(现在优于hdfs3成为首选),考虑到块位置不再是访问HDFS的工作负载的一部分,因为arrow的实现不包括get_block_locations()方法.

Quite right, with the appearance of arrow's HDFS interface, which is now preferred over hdfs3, the consideration of block locations is no longer part of workloads accessing HDFS, since arrow's implementation doesn't include the get_block_locations() method.

但是,我们已经想删除使这项工作变得有些复杂的代码,因为我们发现测试HDFS部署上的节点间带宽完全可以满足大多数工作负载的实际需求.块大小与您希望在内存中划分的分区大小之间的额外限制增加了一层复杂性.

However, we already wanted to remove the somewhat convoluted code which made this work, because we found that the inter-node bandwidth on test HDFS deployments was perfectly adequate that it made little practical difference in most workloads. The extra constrains on the size of the blocks versus the size of the partitions you would like in-memory created an additional layer of complexity.

通过删除专门的代码,我们可以避免HDFS出现的特殊情况,而不是外部云存储(s3,gcs,azure),在这里无关紧要的是哪个工作人员访问了数据的哪一部分.

By removing the specialised code, we could avoid the very special case that was being made for HDFS as opposed to external cloud storage (s3, gcs, azure) where it didn't matter which worker accessed which part of the data.

简而言之,是的,文档应该进行更新.

In short, yes the docs should be updated.

这篇关于Dask是否与HDFS通信以优化数据局部性?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆