删除Dask中的空分区 [英] Remove empty partitions in Dask
问题描述
从CSV加载数据时,某些CSV无法加载,导致分区为空。我想删除所有空分区,因为某些方法似乎不适用于空分区。我尝试过重新分区,例如在 repartition(npartitions = 10)
可以工作的地方,但是大于此值仍会导致空分区。
When loading data from CSV some CSVs cannot be loaded, resulting in an empty partition. I would like to remove all empty partitions, as some methods seem to not work well with empty partitions. I have tried to repartition, where (for example) repartition(npartitions=10)
works, but a value greater than this can still result in empty partitions.
实现此目标的最佳方法是什么?谢谢。
What's the best way of achieving this? Thanks.
推荐答案
我发现过滤Dask数据框(例如,按日期)通常会导致空分区。如果您在使用带有空分区的数据框时遇到问题,请根据MRocklin的指导,使用以下函数来剔除它们:
I've found that filtering a Dask dataframe, e.g., by date, often results in empty partitions. If you're having trouble using a dataframe with empty partitions, here's a function, based on MRocklin's guidance, to cull them:
def cull_empty_partitions(df):
ll = list(df.map_partitions(len).compute())
df_delayed = df.to_delayed()
df_delayed_new = list()
pempty = None
for ix, n in enumerate(ll):
if 0 == n:
pempty = df.get_partition(ix)
else:
df_delayed_new.append(df_delayed[ix])
if pempty is not None:
df = dd.from_delayed(df_delayed_new, meta=pempty)
return df
这篇关于删除Dask中的空分区的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!