从CSV读取时,如何在Dask中添加索引列? [英] What is the way to add an index column in Dask when reading from a CSV?

查看:89
本文介绍了从CSV读取时,如何在Dask中添加索引列?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我试图在一次加载时使用Pandas处理一个不适合内存的相当大的数据集,所以我使用的是Dask.但是,使用read_csv方法读取后,很难在数据集中添加唯一ID列.我不断收到错误消息(请参见代码).我正在尝试创建索引列,以便可以将新列设置为数据索引,但是错误似乎是在告诉我在创建列之前先设置索引.

I'm trying to process a fairly large dataset that doesn't fit into memory using Pandas when loading it at once so I'm using Dask. However, I'm having difficulty in adding a unique ID column to the dataset once read when using the read_csv method. I keep getting an error (see Code). I'm trying to create an index column so I can set that new column as the index for the data, but the error appears to be telling me to set the index first before creating the column.

df = dd.read_csv(r'path\to\file\file.csv')  # File does not have a unique ID column, so I have to create one.
df['index_col'] = dd.from_array(np.arange(len(pc_df)))  # Trying to add an index column and fill it
# ValueError: Not all divisions are known, can't align partitions. Please use `set_index` to set the index.

更新

使用range(1, len(df) + 1将错误更改为:TypeError:列分配不支持类型范围

Update

Using range(1, len(df) + 1 changed the error to: TypeError: Column assignment doesn't support type range

推荐答案

对,很难不读就知道CSV文件每个块中的行数,因此如果0, 1, 2, 3, ...数据集跨越多个分区.

Right, it's hard to know number of lines in each chunk of a CSV file without reading through it, so it's hard to produce an index like 0, 1, 2, 3, ... if the dataset spans multiple partitions.

一种方法是创建一个由一列组成的列:

One approach would be to create a column of ones:

df["idx"] = 1

然后致电cumsum

and then call cumsum

df["idx"] = df["idx"].cumsum()

但是请注意,这确实为支持数据框的任务图添加了一堆依赖关系,因此某些操作可能不像以前那样并行.

But note that this does add a bunch of dependencies to the task graph that backs your dataframe, so some operations might not be as parallel as they were before.

这篇关于从CSV读取时,如何在Dask中添加索引列?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆