从CSV读取时,如何在Dask中添加索引列? [英] What is the way to add an index column in Dask when reading from a CSV?
问题描述
我试图在一次加载时使用Pandas处理一个不适合内存的相当大的数据集,所以我使用的是Dask.但是,使用read_csv方法读取后,很难在数据集中添加唯一ID列.我不断收到错误消息(请参见代码).我正在尝试创建索引列,以便可以将新列设置为数据索引,但是错误似乎是在告诉我在创建列之前先设置索引.
I'm trying to process a fairly large dataset that doesn't fit into memory using Pandas when loading it at once so I'm using Dask. However, I'm having difficulty in adding a unique ID column to the dataset once read when using the read_csv method. I keep getting an error (see Code). I'm trying to create an index column so I can set that new column as the index for the data, but the error appears to be telling me to set the index first before creating the column.
df = dd.read_csv(r'path\to\file\file.csv') # File does not have a unique ID column, so I have to create one.
df['index_col'] = dd.from_array(np.arange(len(pc_df))) # Trying to add an index column and fill it
# ValueError: Not all divisions are known, can't align partitions. Please use `set_index` to set the index.
更新
使用range(1, len(df) + 1
将错误更改为:TypeError:列分配不支持类型范围
Update
Using range(1, len(df) + 1
changed the error to: TypeError: Column assignment doesn't support type range
推荐答案
对,很难不读就知道CSV文件每个块中的行数,因此如果0, 1, 2, 3, ...
数据集跨越多个分区.
Right, it's hard to know number of lines in each chunk of a CSV file without reading through it, so it's hard to produce an index like 0, 1, 2, 3, ...
if the dataset spans multiple partitions.
一种方法是创建一个由一列组成的列:
One approach would be to create a column of ones:
df["idx"] = 1
然后致电cumsum
and then call cumsum
df["idx"] = df["idx"].cumsum()
但是请注意,这确实为支持数据框的任务图添加了一堆依赖关系,因此某些操作可能不像以前那样并行.
But note that this does add a bunch of dependencies to the task graph that backs your dataframe, so some operations might not be as parallel as they were before.
这篇关于从CSV读取时,如何在Dask中添加索引列?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!