快速整理数据 [英] Shuffling data in dask

查看：74 发布时间：2020/10/15 18:39:53 python dask

本文介绍了快速整理数据的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

这是来自子集Dask DataFrames 的问题。我希望先将黄昏数据帧中的数据洗牌，然后再将其批量发送到ML算法。

This is a follow on question from Subsetting Dask DataFrames. I wish to shuffle data from a dask dataframe before sending it in batches to a ML algorithm.

该问题的答案是：

for part in df.repartition(npartitions=100).to_delayed():
    batch = part.compute()

但是，即使我要重新整理批处理的内容，我还是担心它可能并不理想。数据是一个时间序列集，因此每个分区中的数据点都将高度相关。

However, even if I was to shuffle the contents of batch I'm a bit worried that it might not be ideal. The data is a time series set so datapoints would be highly correlated within each partition.

我理想的情况是：

rand_idx = np.random.choice(len(df), batch_size, replace=False)
batch = df.iloc[rand_idx, :]

可以在大熊猫上使用，但不能使用。有什么想法吗？

which would work on pandas but not dask. Any thoughts?

我尝试做的

train_len = int(len_df*0.8)
idx = np.random.permutation(len_df)
train_idx = idx[:train_len]
test_idx = idx[train_len:]
train_df = df.loc[train_idx]
test_df = df.loc[test_idx]

但是，如果我尝试执行 train_df.loc [：5，：]。compute（），这将返回124451行数据帧。因此，显然使用了dask错误。

However, if I try doing train_df.loc[:5,:].compute() this return a 124451 row dataframe. So clearly using dask wrong.

推荐答案

我建议将一列随机数据添加到数据框中，然后使用该列设置索引：

I recommend adding a column of random data to your dataframe and then using that to set the index:

df = df.map_partitions(add_random_column_to_pandas_dataframe, ...)
df = df.set_index('name-of-random-column')

这篇关于快速整理数据的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

快速整理数据 [英] Shuffling data in dask

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

快速整理数据 [英] Shuffling data in dask

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭