快速整理数据 [英] Shuffling data in dask

查看:74
本文介绍了快速整理数据的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

这是来自子集Dask DataFrames 的问题。我希望先将黄昏数据帧中的数据洗牌,然后再将其批量发送到ML算法。

This is a follow on question from Subsetting Dask DataFrames. I wish to shuffle data from a dask dataframe before sending it in batches to a ML algorithm.

该问题的答案是:

for part in df.repartition(npartitions=100).to_delayed():
    batch = part.compute()

但是,即使我要重新整理批处理的内容,我还是担心它可能并不理想。数据是一个时间序列集,因此每个分区中的数据点都将高度相关。

However, even if I was to shuffle the contents of batch I'm a bit worried that it might not be ideal. The data is a time series set so datapoints would be highly correlated within each partition.

我理想的情况是:

rand_idx = np.random.choice(len(df), batch_size, replace=False)
batch = df.iloc[rand_idx, :]

可以在大熊猫上使用,但不能使用。有什么想法吗?

which would work on pandas but not dask. Any thoughts?

我尝试做的

train_len = int(len_df*0.8)
idx = np.random.permutation(len_df)
train_idx = idx[:train_len]
test_idx = idx[train_len:]
train_df = df.loc[train_idx]
test_df = df.loc[test_idx]

但是,如果我尝试执行 train_df.loc [:5,:]。compute(),这将返回124451行数据帧。因此,显然使用了dask错误。

However, if I try doing train_df.loc[:5,:].compute() this return a 124451 row dataframe. So clearly using dask wrong.

推荐答案

我建议将一列随机数据添加到数据框中,然后使用该列设置索引:

I recommend adding a column of random data to your dataframe and then using that to set the index:

df = df.map_partitions(add_random_column_to_pandas_dataframe, ...)
df = df.set_index('name-of-random-column')

这篇关于快速整理数据的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆