在Python Pandas中,当值为0时,使用cumsum和groupby并重置cumsum [英] In Python Pandas using cumsum with groupby and reset of cumsum when value is 0

查看:1419
本文介绍了在Python Pandas中,当值为0时,使用cumsum和groupby并重置cumsum的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在python中比较新。
我试着为每个客户累积一笔钱,以查看结果不活动的月份(标志:1或0)。因此,当我们有一个0时,需要重新设置1的累计和。当我们有一个新客户时,重置也需要发生。看到下面的例子,其中a是客户的列,b是日期。经过一番研究,我发现了'Cumsum reset at NaN'和'In Python熊猫使用cumsum与groupby'。我认为我需要把它们放在一起。
将'在NaN重置Cumsum'的代码调整为0,成功:

  cumsum = v .cumsum()。fillna(method ='pad')
reset = -cumsum [v.isnull()!= 0] .diff()。fillna(cumsum)
result = v.where v.notnull(),reset).cumsum()

但是,我没有成功添加一群。我的计数正好在... ...

因此,一个数据集可能是这样的:
将pandas导入为pd

  df = pd.DataFrame({'a':[1,1,1,1,1,1,1,2,2,2,2,2, 2,2],
'b':[1 / 15,2 / 15,3 / 15,4 / 15,5 / 15,6 / 15,1 / 15,2 / 15,3 / 15, 4 / 15,5 / 15,6 / 15],
'c':[1,0,1,0,1,1,0,1,1,0,1,1,1,1] })

这应该产生一个数据框,列a,b,c和d, p>

 'd':[1,0,1,0,1,2,0,1,2,0,1,2 ,3,4] 

请注意,我有一个非常大的数据集,所以计算时间非常重要。



感谢您的帮助 使用 groupby。应用 cumsum 在找到组中的连续值后。然后 groupby.cumcount

与原始行相乘以创建AND逻辑,取消所有的零(0> c $ c> ,以获得每个连续值的整数)并且只考虑正值。

  df ['d'] = df.groupby('a')['c'] \ 
.apply(lambda x:x *(x.groupby((x!= x.shift())。cumsum())。cumcount()+ 1))

print(df ['d'])

0 1
1 0
2 1
3 0
4 1
5 2
6 0
7 1
8 2
9 0
10 1
11 2
12 3
13 4
名称:d,dtype:int64






另一种方式的做法是在 series.expanding 当前索引。

使用 reduce 稍后将两个参数的函数累积地应用于迭代项目,以便减少它的值为单一值。

pre $ from functools import减少

df.groupby('a') ['c']。expanding()\
.apply(lambda i:reduce(lambda x,y:x + 1 if y == 1 else 0,i,0))

a
1 0 1.0
1 0.0
2 1.0
3 0.0
4 1.0
5 2.0
6 0.0
2 7 1.0
8 2.0
9 0.0
10 1.0
11 2.0
12 3.0
13 4.0
名称:c,dtype:float64

定时

  %% timeit 
df.groupby('a')['c'] .apply(lambda x:x *(x.groupby((x!= x.shift())。cumsum())。cumcount()+ 1))
100个循环,最好是3:3.35 ms循环

%% timeit
df.groupby('a')['c']。expanding()。apply(lambda s:reduce(lambda x,y:x + 1 if y == 1 else 0,s,0))
1000循环,最好是3:每循环1.63 ms


I'm rather new at python. I try to have a cumulative sum for each client to see the consequential months of inactivity (flag: 1 or 0). The cumulative sum of the 1's need therefore to be reset when we have a 0. The reset need to happen as well when we have a new client. See below with example where a is the column of clients and b are the dates.

After some research, I found the question 'Cumsum reset at NaN' and 'In Python Pandas using cumsum with groupby'. I assume that I kind of need to put them together. Adapting the code of 'Cumsum reset at NaN' to the reset towards 0, is successful:

cumsum = v.cumsum().fillna(method='pad')
reset = -cumsum[v.isnull() !=0].diff().fillna(cumsum)
result = v.where(v.notnull(), reset).cumsum()

However, I don't succeed at adding a groupby. My count just goes on...

So, a dataset would be like this: import pandas as pd

df =  pd.DataFrame({'a' : [1,1,1,1,1,1,1,2,2,2,2,2,2,2], 
                    'b' : [1/15,2/15,3/15,4/15,5/15,6/15,1/15,2/15,3/15,4/15,5/15,6/15], 
                    'c' : [1,0,1,0,1,1,0,1,1,0,1,1,1,1]})

this should result in a dataframe with the columns a, b, c and d with

'd' : [1,0,1,0,1,2,0,1,2,0,1,2,3,4]

Please note that I have a very large dataset, so calculation time is really important.

Thank you for helping me

解决方案

Use groupby.apply and cumsum after finding contiguous values in the groups. Then groupby.cumcount to get the integer counting upto each contiguous value and add 1 later.

Multiply with the original row to create the AND logic cancelling all zeros and only considering positive values.

df['d'] = df.groupby('a')['c']                                                            \
            .apply(lambda x: x * (x.groupby((x != x.shift()).cumsum()).cumcount() + 1))

print(df['d'])

0     1
1     0
2     1
3     0
4     1
5     2
6     0
7     1
8     2
9     0
10    1
11    2
12    3
13    4
Name: d, dtype: int64


Another way of doing would be to apply a function after series.expanding on the groupby object which basically computes values on the series starting from the first index upto that current index.

Use reduce later to apply function of two args cumulatively to the items of iterable so as to reduce it to a single value.

from functools import reduce

df.groupby('a')['c'].expanding()                                         \
  .apply(lambda i: reduce(lambda x, y: x+1 if y==1 else 0, i, 0))

a    
1  0     1.0
   1     0.0
   2     1.0
   3     0.0
   4     1.0
   5     2.0
   6     0.0
2  7     1.0
   8     2.0
   9     0.0
   10    1.0
   11    2.0
   12    3.0
   13    4.0
Name: c, dtype: float64

Timings:

%%timeit
df.groupby('a')['c'].apply(lambda x: x * (x.groupby((x != x.shift()).cumsum()).cumcount() + 1))
100 loops, best of 3: 3.35 ms per loop

%%timeit
df.groupby('a')['c'].expanding().apply(lambda s: reduce(lambda x, y: x+1 if y==1 else 0, s, 0))
1000 loops, best of 3: 1.63 ms per loop

这篇关于在Python Pandas中,当值为0时,使用cumsum和groupby并重置cumsum的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆