高效比较每月的跑步总数和每月的跑步总数 [英] Efficiently compare running total for month to total for month

查看：90 发布时间：2020/5/18 22:48:28 python python-3.x pandas numpy

本文介绍了高效比较每月的跑步总数和每月的跑步总数的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有一个数据帧(df).它包含来自模型的predicted每日数据，直到2020年底.随着一年中的每一天，actual和id数据将添加到该行.每天有多个名字

I have a dataframe (df). It contains predicted daily data from a model, up until the end of 2020. As each day passes in the year, actual and id data is added to the row. There are multiple names for each day

+------+-----+-----------+--------+------------+
| NAME | ID  | PREDICTED | ACTUAL | YYYY_MM_DD |
+------+-----+-----------+--------+------------+
| Nir  | 215 | 100       | 400    | 2020-01-01 |
| Nir  | 215 | 200       | 400    | 2020-01-02 |
| Nir  | 215 | 100       | 400    | 2020-01-03 |
| Nir  | 215 | 200       | 400    | 2020-01-04 |
| Nir  | 215 | 100       | 400    | 2020-01-05 |
| Nir  | 215 | 200       | 400    | 2020-01-06 |
| Nir  | 215 | 100       | 400    | 2020-01-07 |
| Nir  | 215 | 200       | 400    | 2020-01-08 |
| Nir  | 215 | 100       | 400    | 2020-01-09 |
| Nir  | 215 | 200       | 400    | 2020-01-10 |
| Nir  | 215 | 100       | 400    | 2020-01-11 |
| Nir  | 215 | 200       | 400    | 2020-01-12 |
| Nir  | 215 | 100       | 400    | 2020-01-13 |
| Nir  | 215 | 200       | 400    | 2020-01-14 |
| Nir  | 215 | 100       | 400    | 2020-01-15 |
| Nir  | 215 | 200       | 400    | 2020-01-16 |
| Nir  | 215 | 100       | 400    | 2020-01-17 |
| Nir  | 215 | 200       | 400    | 2020-01-18 |
| Nir  | 215 | 100       | 400    | 2020-01-19 |
| Nir  | 215 | 200       | 400    | 2020-01-20 |
| Nir  | 215 | 100       | 400    | 2020-01-21 |
| Nir  | 215 | 200       | 400    | 2020-01-22 |
| Nir  | 215 | 100       | 400    | 2020-01-23 |
| Nir  | Nan | 100       | Nan    | 2020-01-24 |
| Nir  | Nan | 100       | Nan    | 2020-01-25 |
| Nir  | Nan | 100       | Nan    | 2020-01-26 |
| Nir  | Nan | 100       | Nan    | 2020-01-27 |
| Nir  | Nan | 100       | Nan    | 2020-01-28 |
| Nir  | Nan | 100       | Nan    | 2020-01-29 |
| Nir  | Nan | 100       | Nan    | 2020-01-30 |
| Nir  | Nan | 100       | Nan    | 2020-01-31 |
| Xyc  | 40  | 800       | 500    | 2020-01-01 |
| Xyc  | 40  | 100       | 500    | 2020-01-02 |
| Xyc  | 40  | 100       | 500    | 2020-01-03 |
| Xyc  | 40  | 100       | 500    | 2020-01-04 |
| ...  | ... | ...       | ...    | ...        |
| ...  | ... | ...       | ...    | ...        |
+------+-----+-----------+--------+------------+

我想添加一个名为payout的附加列.除非到目前为止的actual的总和已经超过predicted的总和，否则payout应该为0.

I want to add an additional column named payout. The payout should be 0 unless the sum of actual, month to date has passed the sum of predicted.

即，对于Nir，我们可以看到predicted的总和为4200.因此，在actual的总和超过4200之前，payout应该为0.一旦超过该阈值，则payout应该为actual-predicted的1％.使用以上数据，输出将如下所示:

I.e., for Nir, we can see the sum of predicted is 4200. So the payout should be 0 until the sum of actual passes 4200. Once that threshold is passed, then the payout should be 1% of actual-predicted. With the above data, the output would look like this:

+------+-----+-----------+--------+---------------+--------+------------+
| NAME | ID  | PREDICTED | ACTUAL | MONTH_TO_DATE | PAYOUT | YYYY_MM_DD |
+------+-----+-----------+--------+---------------+--------+------------+
| Nir  | 215 | 100       | 400    | 400           | 0      | 2020-01-01 |
| Nir  | 215 | 200       | 400    | 800           | 0      | 2020-01-02 |
| Nir  | 215 | 100       | 400    | 1200          | 0      | 2020-01-03 |
| Nir  | 215 | 200       | 400    | 1600          | 0      | 2020-01-04 |
| Nir  | 215 | 100       | 400    | 2000          | 0      | 2020-01-05 |
| Nir  | 215 | 200       | 400    | 2400          | 0      | 2020-01-06 |
| Nir  | 215 | 100       | 400    | 2800          | 0      | 2020-01-07 |
| Nir  | 215 | 200       | 400    | 3200          | 0      | 2020-01-08 |
| Nir  | 215 | 100       | 400    | 3600          | 0      | 2020-01-09 |
| Nir  | 215 | 200       | 400    | 4000          | 0      | 2020-01-10 |
| Nir  | 215 | 100       | 400    | 4400          | 3      | 2020-01-11 |
| Nir  | 215 | 200       | 400    | ...           | 2      | 2020-01-12 |
| Nir  | 215 | 100       | 400    | ...           | 3      | 2020-01-13 |
| Nir  | 215 | 200       | 400    | ...           | 2      | 2020-01-14 |
| Nir  | 215 | 100       | 400    | ...           | 3      | 2020-01-15 |
| Nir  | 215 | 200       | 400    | ...           | 2      | 2020-01-16 |
| Nir  | 215 | 100       | 400    | ...           | 3      | 2020-01-17 |
| Nir  | 215 | 200       | 400    | ...           | 2      | 2020-01-18 |
| Nir  | 215 | 100       | 400    | ...           | 3      | 2020-01-19 |
| Nir  | 215 | 200       | 400    | ...           | 2      | 2020-01-20 |
| Nir  | 215 | 100       | 400    | ...           | 3      | 2020-01-21 |
| Nir  | 215 | 200       | 400    | ...           | 2      | 2020-01-22 |
| Nir  | 215 | 100       | 400    | ...           | 3      | 2020-01-23 |
| Nir  | Nan | 100       | Nan    |               |        | 2020-01-24 |
| Nir  | Nan | 100       | Nan    |               |        | 2020-01-25 |
| Nir  | Nan | 100       | Nan    |               |        | 2020-01-26 |
| Nir  | Nan | 100       | Nan    |               |        | 2020-01-27 |
| Nir  | Nan | 100       | Nan    |               |        | 2020-01-28 |
| Nir  | Nan | 100       | Nan    |               |        | 2020-01-29 |
| Nir  | Nan | 100       | Nan    |               |        | 2020-01-30 |
| Nir  | Nan | 100       | Nan    |               |        | 2020-01-31 |
| Xyc  | 40  | 800       | 500    | 500           | 0      | 2020-01-01 |
| Xyc  | 40  | 100       | 500    | 1000          | 0      | 2020-01-02 |
| Xyc  | 40  | 100       | 500    | 1500          | 4      | 2020-01-03 |
| Xyc  | 40  | 100       | 500    | 2000          | 4      | 2020-01-04 |
| ...  | ... | ...       | ...    |               |        | ...        |
| ...  | ... | ...       | ...    |               |        | ...        |
+------+-----+-----------+--------+---------------+--------+------------+

在上面的输出中，Xyc的总预测值为2000，因此payout应该为0，直到actual的总和也超过2000.在真实的数据框中，每天有约70个name的数据，所以我觉得可能需要分组.

In the above output, Xyc has a total predicted 2000, so payout should be 0 until the sum of actual passes 2000 also. In the real dataframe, there is daily data for ~70 names, so I feel like a grouping may be needed.

我尝试过:

new_sum = [df.actual.values[0]] for i in range(1, len(df.index)): 
    new_sum.append(new_sum[i-1]+df.actual.values[i]) 
df['actual_sum'] = new_sum

但是，这只是让我连续获得了actual的总和.我也尝试过这个:

However, that simply gave me a running total of actual. I also tried this:

df['inc'] = df['actual'] - df['predicted'] df['payout'] = np.where(df['inc']>=1, (df['inc'] / 100) * 1, 0)

但是，以上内容无法确保归因于1％之前的月份的月至今总和>.

But the above doesn't make sure the month to date >= total for the month before attributing the 1%.

推荐答案

首先，您需要从数据中删除NaN行.

First You need to remove NaN rows from data.

您在这里:

import pandas as pd
import numpy as np


df = pd.DataFrame({'Name':['Nir','Nir','Nir','Nir','Xyc','Xyc','Xyc'],'PREDICTED':[100,200,100,200,100,200,300],
                   'ACTUAL':[400,400,400,400,500,500,500],
                   'YYYY_MM_DD':['2020-01-01','2020-01-02','2020-01-03','2020-01-04','2020-01-01','2020-01-02','2020-01-03']})


def calculate(item):
    # select name
    data = df[df['Name'] == item]
    # calculate sum
    sum = data['PREDICTED'].sum()

    # remove NaN rows
    data = data.dropna()

    # calculate and insert  month to date column values
    month_to_date = []
    value = 0
    for index, row in data.iterrows():
        value += row['ACTUAL']
        month_to_date.append(value)

    data.insert(3, "MONTH_TO_DATE", month_to_date, True)

    # calculate and instert payout values
    conditions = [
        (data['MONTH_TO_DATE'] < sum),
        (data['MONTH_TO_DATE'] >= sum)
    ]
    choices = [0, ((data['ACTUAL'] - data['PREDICTED'])/100).astype(int)]
    data.insert(5, "PAYOUT", np.select(conditions, choices), True)

    return data


# collect results
results = pd.DataFrame(columns=['Name','PREDICTED','ACTUAL','MONTH_TO_DATE','YYYY_MM_DD','PAYOUT'])

for item in df['Name'].unique():
    df2 = calculate(item)
    results = results.append(df2)

结果:

  Name PREDICTED ACTUAL MONTH_TO_DATE  YYYY_MM_DD PAYOUT
0  Nir       100    400           400  2020-01-01      0
1  Nir       200    400           800  2020-01-02      2
2  Nir       100    400          1200  2020-01-03      3
3  Nir       200    400          1600  2020-01-04      2
4  Xyc       100    500           500  2020-01-01      0
5  Xyc       200    500          1000  2020-01-02      3
6  Xyc       300    500          1500  2020-01-03      2

这篇关于高效比较每月的跑步总数和每月的跑步总数的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

高效比较每月的跑步总数和每月的跑步总数 [英] Efficiently compare running total for month to total for month

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

高效比较每月的跑步总数和每月的跑步总数 [英] Efficiently compare running total for month to total for month

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭