pandas :条件转移 [英] Pandas: conditional shift

查看:37
本文介绍了 pandas :条件转移的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

有没有一种方法可以根据其他两个列上的条件来移动数据框列?像这样:

There is a way to shift a dataframe column dependently on the condition on two other columns? something like:

df["cumulated_closed_value"] = df.groupby("user").['close_cumsum'].shiftWhile(df['close_time']>df['open_time])

我想出了一种方法来做,但是效率很低:

I have figured out a way to do this but it's inefficient:

1)加载数据并创建要移动的列

df=pd.read_csv('data.csv')
df.sort_values(['user','close_time'],inplace=True)
df['close_cumsum']=df.groupby('user')['value'].cumsum()
df.sort_values(['user','open_time'],inplace=True)
print(df)

输出:

   user  open_time close_time  value  close_cumsum
0     1 2017-01-01 2017-03-01      5            18
1     1 2017-01-02 2017-02-01      6             6
2     1 2017-02-03 2017-02-05      7            13
3     1 2017-02-07 2017-04-01      3            21
4     1 2017-09-07 2017-09-11      1            22
5     2 2018-01-01 2018-02-01     15            15
6     2 2018-03-01 2018-04-01      3            18

2)使用自联接和一些过滤器移动列

自连接(这是内存效率低下) df2 = pd.merge(df [['user','open_time']],df [['user','close_time','close_cumsum']],on ='user')

Self-join (this is memory inefficient) df2=pd.merge(df[['user','open_time']],df[['user','close_time','close_cumsum']], on='user')

"close_time"的过滤器<'open_time'.然后获取具有最大close_time的行

filter for 'close_time' < 'open_time'. Then get the row with the max close_time

df2=df2[df2['close_time']<df2['open_time']]
idx = df2.groupby(['user','open_time'])['close_time'].transform(max) == df2['close_time']
df2=df2[idx]

3)与原始数据集合并:

df3=pd.merge(df[['user','open_time','close_time','value']],df2[['user','open_time','close_cumsum']],how='left')
print(df3)

输出:

   user  open_time close_time  value  close_cumsum
0     1 2017-01-01 2017-03-01      5           NaN
1     1 2017-01-02 2017-02-01      6           NaN
2     1 2017-02-03 2017-02-05      7           6.0
3     1 2017-02-07 2017-04-01      3          13.0
4     1 2017-09-07 2017-09-11      1          21.0
5     2 2018-01-01 2018-02-01     15           NaN
6     2 2018-03-01 2018-04-01      3          15.0

还有更多的熊猫方法可以得到相同的结果?

编辑:我添加了一条数据行以使情况更清楚.我的目标是在新交易开始之前关闭所有交易的总和.

I have added one data line to make the case more clear. My goal is to get the sum of all transactions closed before the opening time of the new transaction

推荐答案

我对您认为应该包括的测试用例进行了修改.此解决方案可以处理您的编辑.

I made a modification to you test case that I think you should include. This solution does handle your edit.

import pandas as pd
import numpy as np
df = pd.read_csv("cond_shift.csv")
df

输入:

   user open_time   close_time  value
0   1   12/30/2016  12/31/2016  1
1   1   1/1/2017    3/1/2017    5
2   1   1/2/2017    2/1/2017    6
3   1   2/3/2017    2/5/2017    7
4   1   2/7/2017    4/1/2017    3
5   1   9/7/2017    9/11/2017   1
6   2   1/1/2018    2/1/2018    15
7   2   3/1/2018    4/1/2018    3

创建要移动的列:

df["open_time"] = pd.to_datetime(df["open_time"])
df["close_time"] = pd.to_datetime(df["close_time"])
df.sort_values(['user','close_time'],inplace=True)
df['close_cumsum']=df.groupby('user')['value'].cumsum()
df.sort_values(['user','open_time'],inplace=True)
df


   user open_time   close_time  value   close_cumsum
0   1   2016-12-30  2016-12-31  1       1
1   1   2017-01-01  2017-03-01  5       19
2   1   2017-01-02  2017-02-01  6       7
3   1   2017-02-03  2017-02-05  7       14
4   1   2017-02-07  2017-04-01  3       22
5   1   2017-09-07  2017-09-11  1       23
6   2   2018-01-01  2018-02-01  15      15
7   2   2018-03-01  2018-04-01  3       18

移位列(在下面说明):

Shift columns (explanation below):

df["cumulated_closed_value"] = df.groupby("user")["close_cumsum"].transform("shift")
condition = ~(df.groupby("user")['close_time'].transform("shift") < df["open_time"])
df.loc[ condition,"cumulated_closed_value" ] = None
df["cumulated_closed_value"] =df.groupby("user")["cumulated_closed_value"].fillna(method="ffill").fillna(0)
df


user    open_time   close_time  value   close_cumsum    cumulated_closed_value
0   1   2016-12-30  2016-12-31  1       1               0.0
1   1   2017-01-01  2017-03-01  5       19              1.0
2   1   2017-01-02  2017-02-01  6       7               1.0
3   1   2017-02-03  2017-02-05  7       14              7.0
4   1   2017-02-07  2017-04-01  3       22              14.0
5   1   2017-09-07  2017-09-11  1       23              22.0
6   2   2018-01-01  2018-02-01  15      15              0.0
7   2   2018-03-01  2018-04-01  3       18              15.0

所有这些都已被编写为可以在所有用户中完成的方式.我相信,如果您一次只关注一个用户,那么逻辑会更容易.

All of this has been written is such a way that it's done across all users. I believe the logic is easier if you only focus on one user at a time.

  • 假设没有事件同时发生.这与将累计总和向下移动一排.
  • 删除与其他事件同时发生的事件.
  • 填写缺少的值.加上前锋填充.

在使用它之前,我仍然会对其进行彻底的测试.时间间隔很奇怪,并且有很多极端情况.

I would still thoroughly test this before you use it. Time intervals are weird and there are a lot of edge cases.

这篇关于 pandas :条件转移的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆