使用两个数据框计算最终值 pandas [英] Using two dataframes to calculate final value pandas

查看:100
本文介绍了使用两个数据框计算最终值 pandas 的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

目前,我有两个要合并到"KEY"上的数据框.我的第一个数据框包含密钥和产品的原始价格.我的第二个数据框每次有人付款时都会收集信息.我需要在df1中创建一个最终的计算列,以显示剩余余额.剩余余额是通过从original_price中减去payment_price来计算的.唯一需要注意的是,只有某些price_codes可以反映付款(13、14和15).

Currently, I have two dataframes where I am merging on 'KEY'. My first dataframe contains a KEY and the original price of a product. My second dataframe collects information for each time a person makes a payment. I need to create a final calculated column in df1 which shows the remaining balance. The remaining balance is calculated by subtracting payment_price from the original_price. The only caveat is that only certain price_codes reflect a payment (13, 14 and 15).

我不确定最好的方法是利用合并还是我可以简单地引用另一个df而不必合并(后一种方法似乎更理想,因为两个df都有500,000,000+行),但是我找不到在这种特定情况下,内容很多.

I'm not sure if the best approach utilizes merges or if I can simply refer to another df without having to merge (the latter approach would seem more ideal since both dfs have 500,000,000+ rows), but I can't find much content on this specific scenario.

df1 = pd.DataFrame({'KEY': ['100000555', '100000009','100000034','100000035', '100000036'], 
              'original_price': [1205.20,1253.25,1852.15,1452.36,1653.21],
              'area': [12, 13, 12,12,12]})
df2 = pd.DataFrame({'KEY': ['100000555', '100000009', '100000009', '100000009', '100000009','100000034','100000034', '100000034'], 
              'payment_price': [134.04, 453.43, 422.32,23.23,10.43,10.47,243.09,23.45],
              'Price_code': ['13', '13', '14','15','16','13','14','15']})

df1:

    KEY         area    original_price
0   100000555   12      1205.20
1   100000009   13      1253.25
2   100000034   12      1852.15
3   100000035   12      1452.36
4   100000036   12      1653.21

df2:

    KEY         payment_price    Price_code
0   100000555   134.04           13
1   100000009   453.43           13
2   100000009   422.32           14
3   100000009   23.23            15
4   100000009   10.43            16
5   100000034   10.47            13
6   100000034   243.09           14
7   100000034   23.45            15

我需要创建一个计算,如果它们与键匹配并且price_code值为13,14或15,则需要从df2中减去任何payment_price.

I need to create a calculation where I need to subtract any payment_price from df2 if they match the key and have price_code values of 13,14, or 15.

最终结果

    KEY         area    original_price    calculated_price
0   100000555   12      1205.20           1071.16          # (1205.20 - 134.04)
1   100000009   13      1253.25           354.27           # (1253.25 - 453.43 - 422.32 - 23.23)
2   100000034   12      1852.15           1575.14          # (1852.15 - 10.47 - 243.09 - 23.45)
3   100000035   12      1452.36           1452.36
4   100000036   12      1653.21           1653.21

我最初的想法是合并两个df,并使用groupby语句执行计算.但是我对此犹豫不决,因为这似乎占用了大量资源,而我的最终df将至少是行数的两倍.另外,我遇​​到了一个麻烦,要写出只包含某些price_codes的计算.所以现在我想知道是否有更好的方法.我愿意接受其他方法或对此脚本有所帮助. 我会坦白地说,我不确定如何为诸如此类的价格编写price_codes的条件语句.下面的代码首先合并dfs,然后创建一列(remaining_price).但是,对于键10000009,我只需要包含price_codes 12、14、15和排除16,但是当前包含16.

My initial inclination was to merge the two dfs and perform the calculation with a groupby statement. But my hesitation with this is that this seems resource heavy and my final df will be at least double the amount of rows. Additionally, I am running into a mental block to write the calculation to only include certain price_codes. So now I'm wondering if there is a better approach. I'm open to other approaches or help with this script. I will be honest in that I'm not entirely sure how to write the the conditionals for the price_codes for something like this. The code below first merges the dfs, then creates a column (remaining_price). However, for KEY 10000009 I need to include only the price_codes 12, 14, 15 and exclude 16, however 16 is currently included.

result = pd.merge(df1, df2,how='left', on='KEY')

codes = [13,14,15]
result['remaining_price'] = result['original_price'] - result['payment_price'].groupby(result['KEY']).transform('sum')

最后,我假设如果使用的是这种方法,则需要删除KEY上所有重复的行以及两个合并的列(price_code,payment_price).

Finally, I assume if this is the approach I use, that I would need to drop all duplicate rows on KEY and the two merged columns (price_code, payment_price).

result = result.drop_duplicates(subset=['KEY'],keep='first')

推荐答案

这是一种方法.不需要显式合并或删除重复项.这是您可能会看到性能改进的地方.

Here is one way. There is no need for an explicit merge or to drop duplicates. This is where you might see a performance improvement.

解决方案

s = df2[df2['Price_code'].isin([13, 14, 15])].groupby('KEY')['payment_price'].sum()

df1['calculated_price'] = df1['original_price'] - df1['KEY'].map(s).fillna(0)

结果

         KEY  area  original_price  calculated_price
0  100000555    12         1205.20           1071.16
1  100000009    13         1253.25            354.27
2  100000034    12         1852.15           1575.14
3  100000035    12         1452.36           1452.36
4  100000036    12         1653.21           1653.21

说明

  • 根据需要按价格代码过滤df2,按KEY汇总付款价格,最后求和.结果是将KEY映射到付款金额的序列.
  • 使用map将这些总和映射到df1中的KEY并从original_price中减去.
  • Filter df2 by Price_code as required, aggregate payment_price by KEY and finally sum. The result is a series mapping KEY to sum of payments.
  • Use map to map these summations to KEY in df1 and subtract from original_price.

这篇关于使用两个数据框计算最终值 pandas 的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆