删除Dask中的空分区 [英] Remove empty partitions in Dask

查看:92
本文介绍了删除Dask中的空分区的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

从CSV加载数据时,某些CSV无法加载,导致分区为空。我想删除所有空分区,因为某些方法似乎不适用于空分区。我尝试过重新分区,例如在 repartition(npartitions = 10)可以工作的地方,但是大于此值仍会导致空分区。

When loading data from CSV some CSVs cannot be loaded, resulting in an empty partition. I would like to remove all empty partitions, as some methods seem to not work well with empty partitions. I have tried to repartition, where (for example) repartition(npartitions=10) works, but a value greater than this can still result in empty partitions.

实现此目标的最佳方法是什么?谢谢。

What's the best way of achieving this? Thanks.

推荐答案

我发现过滤Dask数据框(例如,按日期)通常会导致空分区。如果您在使用带有空分区的数据框时遇到问题,请根据MRocklin的指导,使用以下函数来剔除它们:

I've found that filtering a Dask dataframe, e.g., by date, often results in empty partitions. If you're having trouble using a dataframe with empty partitions, here's a function, based on MRocklin's guidance, to cull them:

def cull_empty_partitions(df):
    ll = list(df.map_partitions(len).compute())
    df_delayed = df.to_delayed()
    df_delayed_new = list()
    pempty = None
    for ix, n in enumerate(ll):
        if 0 == n:
            pempty = df.get_partition(ix)
        else:
            df_delayed_new.append(df_delayed[ix])
    if pempty is not None:
        df = dd.from_delayed(df_delayed_new, meta=pempty)
    return df

这篇关于删除Dask中的空分区的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆