使用 Groupby 识别 Pandas Dataframe 中连续的相同值 [英] Identify consecutive same values in Pandas Dataframe, with a Groupby
问题描述
我有以下数据帧 df:
I have the following dataframe df:
data={'id':[1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2,2,2],
'value':[2,2,3,2,2,2,3,3,3,3,1,4,1,1,1,4,4,1,1,1,1,1]}
df=pd.DataFrame.from_dict(data)
df
Out[8]:
id value
0 1 2
1 1 2
2 1 3
3 1 2
4 1 2
5 1 2
6 1 3
7 1 3
8 1 3
9 1 3
10 2 1
11 2 4
12 2 1
13 2 1
14 2 1
15 2 4
16 2 4
17 2 1
18 2 1
19 2 1
20 2 1
21 2 1
我需要做的是在 id 级别(df.groupby['id'])识别当值连续显示相同数字 3 次或更多次时.
What I need to do is identify at the id level (df.groupby['id']) when the value shows the same number consecutively for 3 or more times.
我希望得到以下结果:
df
Out[12]:
id value flag
0 1 2 0
1 1 2 0
2 1 3 0
3 1 2 1
4 1 2 1
5 1 2 1
6 1 3 1
7 1 3 1
8 1 3 1
9 1 3 1
10 2 1 0
11 2 4 0
12 2 1 1
13 2 1 1
14 2 1 1
15 2 4 0
16 2 4 0
17 2 1 1
18 2 1 1
19 2 1 1
20 2 1 1
21 2 1 1
我已经尝试使用 pandas rolling.mean 来确定 groupby 和 lambda 的变体,然后将滚动周期的平均值与值"进行比较,并且在它们相同的地方表示一个标志.但这有几个问题,包括您可能有不同的值,这些值将平均为您尝试标记的值.此外,我无法弄清楚如何标记"创建初始标记的滚动均值的所有值.看到这里,这标识了标志的右侧",但是我需要填充滚动平均长度的先前值.在此处查看我的代码:
I have tried variations of groupby and lambda using pandas rolling.mean to identify where the average of the rolling period is then compared to the 'value', and where they are the same this indicates a flag. But this has several problems, including that you could have different values that will average to the value you are trying to flag. Also, I can't figure out how to 'flag' all of the values of the rolling mean that created the initial flag. See here, this identifies the 'right side' of the flag, but then I need to fill the previous values of the rolling mean length. See my code here:
test=df.copy()
test['rma']=test.groupby('id')['value'].transform(lambda x: x.rolling(min_periods=3,window=3).mean())
test['flag']=np.where(test.rma==test.value,1,0)
结果如下:
test
Out[61]:
id value rma flag
0 1 2 NaN 0
1 1 2 NaN 0
2 1 3 2.333333 0
3 1 2 2.333333 0
4 1 2 2.333333 0
5 1 2 2.000000 1
6 1 3 2.333333 0
7 1 3 2.666667 0
8 1 3 3.000000 1
9 1 3 3.000000 1
10 2 1 NaN 0
11 2 4 NaN 0
12 2 1 2.000000 0
13 2 1 2.000000 0
14 2 1 1.000000 1
15 2 4 2.000000 0
16 2 4 3.000000 0
17 2 1 3.000000 0
18 2 1 2.000000 0
19 2 1 1.000000 1
20 2 1 1.000000 1
21 2 1 1.000000 1
迫不及待想看看我错过了什么!谢谢
Can't wait to see what I am missing! Thanks
推荐答案
你可以试试这个;1) 使用 df.value.diff().ne(0).cumsum()
创建一个额外的组变量来表示值的变化;2)使用transform('size')
计算组大小并与三个比较,然后得到你需要的flag
列:
You can try this; 1) Create an extra group variable with df.value.diff().ne(0).cumsum()
to denote the value changes; 2) use transform('size')
to calculate the group size and compare with three, then you get the flag
column you need:
df['flag'] = df.value.groupby([df.id, df.value.diff().ne(0).cumsum()]).transform('size').ge(3).astype(int)
df
故障:
1) diff
不等于零(这就是 df.value.diff().ne(0)
的意思) 给出一个条件 True
每当值发生变化时:
1) diff
is not equal to zero (which is literally what df.value.diff().ne(0)
means) gives a condition True
whenever there is a value change:
df.value.diff().ne(0)
#0 True
#1 False
#2 True
#3 True
#4 False
#5 False
#6 True
#7 False
#8 False
#9 False
#10 True
#11 True
#12 True
#13 False
#14 False
#15 True
#16 False
#17 True
#18 False
#19 False
#20 False
#21 False
#Name: value, dtype: bool
2) 然后 cumsum
给出一个非降序的 id 序列,其中每个 id 表示一个具有相同值的连续块,注意在对布尔值求和时,True
被认为是一个而 False
被视为零:
2) Then cumsum
gives a non descending sequence of ids where each id denotes a consecutive chunk with same values, note when summing boolean values, True
is considered as one while False
is considered as zero:
df.value.diff().ne(0).cumsum()
#0 1
#1 1
#2 2
#3 3
#4 3
#5 3
#6 4
#7 4
#8 4
#9 4
#10 5
#11 6
#12 7
#13 7
#14 7
#15 8
#16 8
#17 9
#18 9
#19 9
#20 9
#21 9
#Name: value, dtype: int64
3) 结合id
列,可以对数据框进行分组,计算分组大小,得到flag
列.
3) combined with id
column, you can group the data frame, calculate the group size and get the flag
column.
这篇关于使用 Groupby 识别 Pandas Dataframe 中连续的相同值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!