dask dataframe最佳分区大小,适用于70GB数据联接操作 [英] dask dataframe optimal partition size for 70GB data join operations

查看:95
本文介绍了dask dataframe最佳分区大小,适用于70GB数据联接操作的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个大约70GB的dask数据帧和3个无法容纳到内存中的列.我的机器是8 CORE Xeon,具有64GB的Ram和本地的Dask群集.

I have a dask dataframe of around 70GB and 3 columns that does not fit into memory. My machine is an 8 CORE Xeon with 64GB of Ram with a local Dask Cluster.

我必须将3列中的每列都包含在另一个更大的数据框中.

I have to take each of the 3 columns and join them to another even larger dataframe.

文档建议分区大小为100MB.但是,鉴于这种数据量,加入700个分区似乎比例如加入70个分区(1000 MB)要多得多.

The documentation recommends to have partition sizes of 100MB. However, given this size of data, joining 700 partitions seems to be a lot more work than for example joining 70 partitions a 1000MB.

是否有理由将其保持在700 x 100MB的分区上?如果不是,应该在此处使用哪个分区大小?这还取决于我使用的工人数量吗?

Is there a reason to keep it at 700 x 100MB partitions? If not which partition size should be used here? Does this also depend on the number of workers I use?

  • 1 x 50GB工作人员
  • 2 x 25GB工作人员
  • 3 x 17GB工作人员

推荐答案

最佳分区大小取决于许多不同的因素,包括可用的RAM,正在使用的线程数,数据集的大小,以及在许多情况下,您正在执行的计算.

Optimal partition size depends on many different things, including available RAM, the number of threads you're using, how large your dataset is, and in many cases the computation that you're doing.

例如,在您的情况下,如果您的联接/合并代码可能是您的数据具有高度重复性,那么100MB分区可能会迅速扩展到100x到10GB分区,并迅速填满内存.否则他们可能不会;这取决于您的数据.另一方面,加入/合并代码确实会产生 n * log(n)个任务,因此减少任务数量(并因此增加分区大小)会非常有利.

For example, in your case if your join/merge code it could be that your data is highly repetitive, and so your 100MB partitions may quickly expand out 100x to 10GB partitions, and quickly fill up memory. Or they might not; it depends on your data. On the other hand join/merge code does produce n*log(n) tasks, so reducing the number of tasks (and so increasing partition size) can be highly advantageous.

确定最佳分区大小具有挑战性.通常,我们能做的最好的事情就是提供有关正在发生的事情的见解.在这里可用:

Determining optimal partition size is challenging. Generally the best we can do is to provide insight about what is going on. That is available here:

这篇关于dask dataframe最佳分区大小,适用于70GB数据联接操作的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆