在新列中添加唯一标识符,直到另一列满足条件 [英] Add a unique identifier in a new column until a condition met on another column
问题描述
我有一个npartition = 8的dask数据帧,这是数据的快照:
I have a dask dataframe with npartition=8, here is the snapshot of the data:
id1 id2 Page_nbr record_type
St1 Sc1 3 START
Sc1 St1 5 ADD
Sc1 St1 9 OTHER
Sc2 St2 34 START
Sc2 St2 45 DURATION
Sc2 St2 65 END
Sc3 Sc3 4 START
我想在record_type之后添加一列,并根据记录类型的条件,因此直到下一个record_type = START添加相同的唯一group_id,输出将如下所示:
I want to add a column after record_type and add a unique group_id based on the condition of record type, so till the next record_type=START add the same unique group_id, output will look like below:
id1 id2 Page_nbr record_type group_id
St1 Sc1 3 START 1
Sc1 St1 5 ADD 1
Sc1 St1 9 OTHER 1
Sc2 St2 34 START 2
Sc2 St2 45 DURATION 2
Sc2 St2 65 END 2
Sc3 Sc3 4 START 3
group_id可以是任何唯一数字。由于数据帧巨大,因此对行进行迭代可能不是最佳选择。想知道是否有任何Python方法吗?
The group_id can be any unique number. As the dataframe is huge iterating over rows may not be the best option. Wondering if there is any pythonic way to do so?
推荐答案
采用 record_type列,与 START进行比较,然后然后计算总和
:
Take the "record_type" column, compare to "START", and then compute the cumsum
:
ddf['group_id'] = ddf['record_type'].eq('START').cumsum()
ddf.compute()
id1 id2 Page_nbr record_type group_id
0 St1 Sc1 3 START 1
1 Sc1 St1 5 ADD 1
2 Sc1 St1 9 OTHER 1
3 Sc2 St2 34 START 2
4 Sc2 St2 45 DURATION 2
5 Sc2 St2 65 END 2
6 Sc3 Sc3 4 START 3
这篇关于在新列中添加唯一标识符,直到另一列满足条件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!