有效划分dask数据帧的策略 [英] Strategy for partitioning dask dataframes efficiently

查看:85
本文介绍了有效划分dask数据帧的策略的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

Dask的文档讨论了重新分区以减少开销

The documentation for Dask talks about repartioning to reduce overhead here.

但是,它们似乎表明您需要事先了解数据框的外观(即,将有预期数据的1/100).

They however seem to indicate you need some knowledge of what your dataframe will look like beforehand (ie that there will 1/100th of the data expected).

是否有一种明智的方法来进行分区而不作任何假设?此刻,我只是用npartitions = ncores * magic_number重新分区,并在需要时将力设置为True来扩展分区.这个大小适合所有方法,但是由于我的数据集大小不同,肯定是次优的.

Is there a good way to repartition sensibly without making assumptions? At the moment I just repartition with npartitions = ncores * magic_number, and set force to True to expand partitions if need be. This one size fits all approach works but is definitely suboptimal as my dataset varies in size.

数据是时间序列数据,但不幸的是不是按固定的时间间隔,过去我使用按时间频率进行重新分区,但是由于数据的不规则性,这可能不是最优的(有时数分钟内没有,然后几千秒内没有)

The data is time series data, but unfortunately not at regular intervals, I've used repartition by time frequency in the past but this would be suboptimal because of how irregular the data is (sometimes nothing for minutes then thousands in seconds)

推荐答案

在与mrocklin讨论之后,一种不错的分区策略是针对df.memory_usage().sum().compute()指导的100MB分区大小.对于适合RAM的数据集,可以通过在相关点处使用df.persist()来减轻可能涉及的其他工作.

After discussion with mrocklin a decent strategy for partitioning is to aim for 100MB partition sizes guided by df.memory_usage().sum().compute(). With datasets that fit in RAM the additional work this might involve can be mitigated with use of df.persist() placed at relevant points.

这篇关于有效划分dask数据帧的策略的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆