使用Groupby识别Pandas Dataframe中的连续相同值 [英] Identify consecutive same values in Pandas Dataframe, with a Groupby

查看:1469
本文介绍了使用Groupby识别Pandas Dataframe中的连续相同值的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有以下数据框df:

data={'id':[1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2,2,2],
      'value':[2,2,3,2,2,2,3,3,3,3,1,4,1,1,1,4,4,1,1,1,1,1]}
df=pd.DataFrame.from_dict(data)
df
Out[8]: 
    id  value
0    1      2
1    1      2
2    1      3
3    1      2
4    1      2
5    1      2
6    1      3
7    1      3
8    1      3
9    1      3
10   2      1
11   2      4
12   2      1
13   2      1
14   2      1
15   2      4
16   2      4
17   2      1
18   2      1
19   2      1
20   2      1
21   2      1

我需要做的是在ID级别(df.groupby ['id'])进行识别,当该值连续3次或多次连续显示相同的数字时.

What I need to do is identify at the id level (df.groupby['id']) when the value shows the same number consecutively for 3 or more times.

我希望得到以下结果:

df
Out[12]: 
    id  value  flag
0    1      2     0
1    1      2     0
2    1      3     0
3    1      2     1
4    1      2     1
5    1      2     1
6    1      3     1
7    1      3     1
8    1      3     1
9    1      3     1
10   2      1     0
11   2      4     0
12   2      1     1
13   2      1     1
14   2      1     1
15   2      4     0
16   2      4     0
17   2      1     1
18   2      1     1
19   2      1     1
20   2      1     1
21   2      1     1

我尝试使用pandas rolling.mean来尝试groupby和lambda的变体,以确定随后将平均滚动周期与值"进行比较的地方,以及在相同的地方这表示一个标志.但这有几个问题,包括您可能拥有不同的值,这些值的平均值将等于您要标记的值.另外,我不知道如何标记"创建初始标记的滚动平均值的所有值.看到这里,它标识了标志的右侧",但是随后我需要填充滚动平均长度的先前值.在这里查看我的代码:

I have tried variations of groupby and lambda using pandas rolling.mean to identify where the average of the rolling period is then compared to the 'value', and where they are the same this indicates a flag. But this has several problems, including that you could have different values that will average to the value you are trying to flag. Also, I can't figure out how to 'flag' all of the values of the rolling mean that created the initial flag. See here, this identifies the 'right side' of the flag, but then I need to fill the previous values of the rolling mean length. See my code here:

test=df.copy()
test['rma']=test.groupby('id')['value'].transform(lambda x: x.rolling(min_periods=3,window=3).mean())
test['flag']=np.where(test.rma==test.value,1,0)

结果在这里:

test
Out[61]: 
    id  value       rma  flag
0    1      2       NaN     0
1    1      2       NaN     0
2    1      3  2.333333     0
3    1      2  2.333333     0
4    1      2  2.333333     0
5    1      2  2.000000     1
6    1      3  2.333333     0
7    1      3  2.666667     0
8    1      3  3.000000     1
9    1      3  3.000000     1
10   2      1       NaN     0
11   2      4       NaN     0
12   2      1  2.000000     0
13   2      1  2.000000     0
14   2      1  1.000000     1
15   2      4  2.000000     0
16   2      4  3.000000     0
17   2      1  3.000000     0
18   2      1  2.000000     0
19   2      1  1.000000     1
20   2      1  1.000000     1
21   2      1  1.000000     1

等不及要看我想念的东西!谢谢

Can't wait to see what I am missing! Thanks

推荐答案

您可以尝试一下; 1)使用df.value.diff().ne(0).cumsum()创建一个额外的组变量以表示值更改; 2)使用transform('size')计算组大小并将其与三个进行比较,然后您将获得所需的flag列:

You can try this; 1) Create an extra group variable with df.value.diff().ne(0).cumsum() to denote the value changes; 2) use transform('size') to calculate the group size and compare with three, then you get the flag column you need:

df['flag'] = df.value.groupby([df.id, df.value.diff().ne(0).cumsum()]).transform('size').ge(3).astype(int) 
df

突发事件:

1) diff不等于零(这实际上是df.value.diff().ne(0)的意思)在值发生变化时给出条件True:

1) diff is not equal to zero (which is literally what df.value.diff().ne(0) means) gives a condition True whenever there is a value change:

df.value.diff().ne(0)
#0      True
#1     False
#2      True
#3      True
#4     False
#5     False
#6      True
#7     False
#8     False
#9     False
#10     True
#11     True
#12     True
#13    False
#14    False
#15     True
#16    False
#17     True
#18    False
#19    False
#20    False
#21    False
#Name: value, dtype: bool

2)然后cumsum给出id的非降序序列,其中每个id表示具有相同值的连续块,请注意在对布尔值求和时,True被视为1,而False被视为零:

2) Then cumsum gives a non descending sequence of ids where each id denotes a consecutive chunk with same values, note when summing boolean values, True is considered as one while False is considered as zero:

df.value.diff().ne(0).cumsum()
#0     1
#1     1
#2     2
#3     3
#4     3
#5     3
#6     4
#7     4
#8     4
#9     4
#10    5
#11    6
#12    7
#13    7
#14    7
#15    8
#16    8
#17    9
#18    9
#19    9
#20    9
#21    9
#Name: value, dtype: int64

3)与id列结合使用,您可以对数据框进行分组,计算组大小并获得flag列.

3) combined with id column, you can group the data frame, calculate the group size and get the flag column.

这篇关于使用Groupby识别Pandas Dataframe中的连续相同值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆