切片Dask数据框 [英] Slicing a Dask Dataframe
本文介绍了切片Dask数据框的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!
问题描述
我有以下代码,希望在Dask数据帧上进行训练/测试拆分
I have the following code where I like to do a train/test split on a Dask dataframe
df = dd.read_csv(csv_filename, sep=',', encoding="latin-1",
names=cols, header=0, dtype='str')
但是当我尝试对火车进行
But when I try to do slices like
for train, test in cv.split(X, y):
df.fit(X[train], y[train])
它失败并显示错误
KeyError: '[11639 11641 11642 ..., 34997 34998 34999] not in index'
有任何想法吗?
推荐答案
Dask.dataframe不支持按行切片。如果您有合理的索引,它确实支持 loc
操作。
Dask.dataframe doesn't support row-wise slicing. It does support the loc
operation if you have a sensible index.
但是在火车/测试拆分的情况下 random_split
However in your case of train/test splitting you will probably be better served by the random_split method.
train, test = df.random_split([0.80, 0.20])
您也可以采用多种方式进行拆分和合并
You could also make many splits and concat in different ways
splits = df.random_split([0.20, 0.20, 0.20, 0.20, 0.20])
for i in range(5):
trains = [splits[j] for j in range(5) if j != i]
train = dd.concat(trains, axis=0)
test = splits[i]
这篇关于切片Dask数据框的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!
查看全文