有效划分dask数据帧的策略 [英] Strategy for partitioning dask dataframes efficiently

查看：85 发布时间：2020/5/21 20:50:42 python optimization dataframe dask

本文介绍了有效划分dask数据帧的策略的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

Dask的文档讨论了重新分区以减少开销

The documentation for Dask talks about repartioning to reduce overhead here.

但是，它们似乎表明您需要事先了解数据框的外观(即，将有预期数据的1/100).

They however seem to indicate you need some knowledge of what your dataframe will look like beforehand (ie that there will 1/100th of the data expected).

是否有一种明智的方法来进行分区而不作任何假设?此刻，我只是用npartitions = ncores * magic_number重新分区，并在需要时将力设置为True来扩展分区.这个大小适合所有方法，但是由于我的数据集大小不同，肯定是次优的.

Is there a good way to repartition sensibly without making assumptions? At the moment I just repartition with npartitions = ncores * magic_number, and set force to True to expand partitions if need be. This one size fits all approach works but is definitely suboptimal as my dataset varies in size.

数据是时间序列数据，但不幸的是不是按固定的时间间隔，过去我使用按时间频率进行重新分区，但是由于数据的不规则性，这可能不是最优的(有时数分钟内没有，然后几千秒内没有)

The data is time series data, but unfortunately not at regular intervals, I've used repartition by time frequency in the past but this would be suboptimal because of how irregular the data is (sometimes nothing for minutes then thousands in seconds)

有效划分dask数据帧的策略 [英] Strategy for partitioning dask dataframes efficiently

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

有效划分dask数据帧的策略 [英] Strategy for partitioning dask dataframes efficiently

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭