高效比较每月的跑步总数和每月的跑步总数 [英] Efficiently compare running total for month to total for month
问题描述
我有一个数据帧(df
).它包含来自模型的predicted
每日数据,直到2020年底.随着一年中的每一天,actual
和id
数据将添加到该行.每天有多个名字
I have a dataframe (df
). It contains predicted
daily data from a model, up until the end of 2020. As each day passes in the year, actual
and id
data is added to the row. There are multiple names for each day
+------+-----+-----------+--------+------------+
| NAME | ID | PREDICTED | ACTUAL | YYYY_MM_DD |
+------+-----+-----------+--------+------------+
| Nir | 215 | 100 | 400 | 2020-01-01 |
| Nir | 215 | 200 | 400 | 2020-01-02 |
| Nir | 215 | 100 | 400 | 2020-01-03 |
| Nir | 215 | 200 | 400 | 2020-01-04 |
| Nir | 215 | 100 | 400 | 2020-01-05 |
| Nir | 215 | 200 | 400 | 2020-01-06 |
| Nir | 215 | 100 | 400 | 2020-01-07 |
| Nir | 215 | 200 | 400 | 2020-01-08 |
| Nir | 215 | 100 | 400 | 2020-01-09 |
| Nir | 215 | 200 | 400 | 2020-01-10 |
| Nir | 215 | 100 | 400 | 2020-01-11 |
| Nir | 215 | 200 | 400 | 2020-01-12 |
| Nir | 215 | 100 | 400 | 2020-01-13 |
| Nir | 215 | 200 | 400 | 2020-01-14 |
| Nir | 215 | 100 | 400 | 2020-01-15 |
| Nir | 215 | 200 | 400 | 2020-01-16 |
| Nir | 215 | 100 | 400 | 2020-01-17 |
| Nir | 215 | 200 | 400 | 2020-01-18 |
| Nir | 215 | 100 | 400 | 2020-01-19 |
| Nir | 215 | 200 | 400 | 2020-01-20 |
| Nir | 215 | 100 | 400 | 2020-01-21 |
| Nir | 215 | 200 | 400 | 2020-01-22 |
| Nir | 215 | 100 | 400 | 2020-01-23 |
| Nir | Nan | 100 | Nan | 2020-01-24 |
| Nir | Nan | 100 | Nan | 2020-01-25 |
| Nir | Nan | 100 | Nan | 2020-01-26 |
| Nir | Nan | 100 | Nan | 2020-01-27 |
| Nir | Nan | 100 | Nan | 2020-01-28 |
| Nir | Nan | 100 | Nan | 2020-01-29 |
| Nir | Nan | 100 | Nan | 2020-01-30 |
| Nir | Nan | 100 | Nan | 2020-01-31 |
| Xyc | 40 | 800 | 500 | 2020-01-01 |
| Xyc | 40 | 100 | 500 | 2020-01-02 |
| Xyc | 40 | 100 | 500 | 2020-01-03 |
| Xyc | 40 | 100 | 500 | 2020-01-04 |
| ... | ... | ... | ... | ... |
| ... | ... | ... | ... | ... |
+------+-----+-----------+--------+------------+
我想添加一个名为payout
的附加列.除非到目前为止的actual
的总和已经超过predicted
的总和,否则payout
应该为0.
I want to add an additional column named payout
. The payout
should be 0 unless the sum of actual
, month to date has passed the sum of predicted
.
即,对于Nir
,我们可以看到predicted
的总和为4200
.因此,在actual
的总和超过4200之前,payout
应该为0.一旦超过该阈值,则payout
应该为actual-predicted
的1%.使用以上数据,输出将如下所示:
I.e., for Nir
, we can see the sum of predicted
is 4200
. So the payout
should be 0 until the sum of actual
passes 4200. Once that threshold is passed, then the payout
should be 1% of actual-predicted
. With the above data, the output would look like this:
+------+-----+-----------+--------+---------------+--------+------------+
| NAME | ID | PREDICTED | ACTUAL | MONTH_TO_DATE | PAYOUT | YYYY_MM_DD |
+------+-----+-----------+--------+---------------+--------+------------+
| Nir | 215 | 100 | 400 | 400 | 0 | 2020-01-01 |
| Nir | 215 | 200 | 400 | 800 | 0 | 2020-01-02 |
| Nir | 215 | 100 | 400 | 1200 | 0 | 2020-01-03 |
| Nir | 215 | 200 | 400 | 1600 | 0 | 2020-01-04 |
| Nir | 215 | 100 | 400 | 2000 | 0 | 2020-01-05 |
| Nir | 215 | 200 | 400 | 2400 | 0 | 2020-01-06 |
| Nir | 215 | 100 | 400 | 2800 | 0 | 2020-01-07 |
| Nir | 215 | 200 | 400 | 3200 | 0 | 2020-01-08 |
| Nir | 215 | 100 | 400 | 3600 | 0 | 2020-01-09 |
| Nir | 215 | 200 | 400 | 4000 | 0 | 2020-01-10 |
| Nir | 215 | 100 | 400 | 4400 | 3 | 2020-01-11 |
| Nir | 215 | 200 | 400 | ... | 2 | 2020-01-12 |
| Nir | 215 | 100 | 400 | ... | 3 | 2020-01-13 |
| Nir | 215 | 200 | 400 | ... | 2 | 2020-01-14 |
| Nir | 215 | 100 | 400 | ... | 3 | 2020-01-15 |
| Nir | 215 | 200 | 400 | ... | 2 | 2020-01-16 |
| Nir | 215 | 100 | 400 | ... | 3 | 2020-01-17 |
| Nir | 215 | 200 | 400 | ... | 2 | 2020-01-18 |
| Nir | 215 | 100 | 400 | ... | 3 | 2020-01-19 |
| Nir | 215 | 200 | 400 | ... | 2 | 2020-01-20 |
| Nir | 215 | 100 | 400 | ... | 3 | 2020-01-21 |
| Nir | 215 | 200 | 400 | ... | 2 | 2020-01-22 |
| Nir | 215 | 100 | 400 | ... | 3 | 2020-01-23 |
| Nir | Nan | 100 | Nan | | | 2020-01-24 |
| Nir | Nan | 100 | Nan | | | 2020-01-25 |
| Nir | Nan | 100 | Nan | | | 2020-01-26 |
| Nir | Nan | 100 | Nan | | | 2020-01-27 |
| Nir | Nan | 100 | Nan | | | 2020-01-28 |
| Nir | Nan | 100 | Nan | | | 2020-01-29 |
| Nir | Nan | 100 | Nan | | | 2020-01-30 |
| Nir | Nan | 100 | Nan | | | 2020-01-31 |
| Xyc | 40 | 800 | 500 | 500 | 0 | 2020-01-01 |
| Xyc | 40 | 100 | 500 | 1000 | 0 | 2020-01-02 |
| Xyc | 40 | 100 | 500 | 1500 | 4 | 2020-01-03 |
| Xyc | 40 | 100 | 500 | 2000 | 4 | 2020-01-04 |
| ... | ... | ... | ... | | | ... |
| ... | ... | ... | ... | | | ... |
+------+-----+-----------+--------+---------------+--------+------------+
在上面的输出中,Xyc
的总预测值为2000,因此payout
应该为0,直到actual
的总和也超过2000.在真实的数据框中,每天有约70个name
的数据,所以我觉得可能需要分组.
In the above output, Xyc
has a total predicted 2000, so payout
should be 0 until the sum of actual
passes 2000 also. In the real dataframe, there is daily data for ~70 name
s, so I feel like a grouping may be needed.
我尝试过:
new_sum = [df.actual.values[0]] for i in range(1, len(df.index)):
new_sum.append(new_sum[i-1]+df.actual.values[i])
df['actual_sum'] = new_sum
但是,这只是让我连续获得了actual
的总和.我也尝试过这个:
However, that simply gave me a running total of actual
. I also tried this:
df['inc'] = df['actual'] - df['predicted'] df['payout'] = np.where(df['inc']>=1, (df['inc'] / 100) * 1, 0)
但是,以上内容无法确保归因于1%之前的月份的月至今总和>.
But the above doesn't make sure the month to date >= total for the month before attributing the 1%.
推荐答案
首先,您需要从数据中删除NaN行.
First You need to remove NaN rows from data.
您在这里:
import pandas as pd
import numpy as np
df = pd.DataFrame({'Name':['Nir','Nir','Nir','Nir','Xyc','Xyc','Xyc'],'PREDICTED':[100,200,100,200,100,200,300],
'ACTUAL':[400,400,400,400,500,500,500],
'YYYY_MM_DD':['2020-01-01','2020-01-02','2020-01-03','2020-01-04','2020-01-01','2020-01-02','2020-01-03']})
def calculate(item):
# select name
data = df[df['Name'] == item]
# calculate sum
sum = data['PREDICTED'].sum()
# remove NaN rows
data = data.dropna()
# calculate and insert month to date column values
month_to_date = []
value = 0
for index, row in data.iterrows():
value += row['ACTUAL']
month_to_date.append(value)
data.insert(3, "MONTH_TO_DATE", month_to_date, True)
# calculate and instert payout values
conditions = [
(data['MONTH_TO_DATE'] < sum),
(data['MONTH_TO_DATE'] >= sum)
]
choices = [0, ((data['ACTUAL'] - data['PREDICTED'])/100).astype(int)]
data.insert(5, "PAYOUT", np.select(conditions, choices), True)
return data
# collect results
results = pd.DataFrame(columns=['Name','PREDICTED','ACTUAL','MONTH_TO_DATE','YYYY_MM_DD','PAYOUT'])
for item in df['Name'].unique():
df2 = calculate(item)
results = results.append(df2)
结果:
Name PREDICTED ACTUAL MONTH_TO_DATE YYYY_MM_DD PAYOUT
0 Nir 100 400 400 2020-01-01 0
1 Nir 200 400 800 2020-01-02 2
2 Nir 100 400 1200 2020-01-03 3
3 Nir 200 400 1600 2020-01-04 2
4 Xyc 100 500 500 2020-01-01 0
5 Xyc 200 500 1000 2020-01-02 3
6 Xyc 300 500 1500 2020-01-03 2
这篇关于高效比较每月的跑步总数和每月的跑步总数的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!