使用两个数据框计算最终值 pandas [英] Using two dataframes to calculate final value pandas
问题描述
目前,我有两个要合并到"KEY"上的数据框.我的第一个数据框包含密钥和产品的原始价格.我的第二个数据框每次有人付款时都会收集信息.我需要在df1中创建一个最终的计算列,以显示剩余余额.剩余余额是通过从original_price中减去payment_price来计算的.唯一需要注意的是,只有某些price_codes可以反映付款(13、14和15).
Currently, I have two dataframes where I am merging on 'KEY'. My first dataframe contains a KEY and the original price of a product. My second dataframe collects information for each time a person makes a payment. I need to create a final calculated column in df1 which shows the remaining balance. The remaining balance is calculated by subtracting payment_price from the original_price. The only caveat is that only certain price_codes reflect a payment (13, 14 and 15).
我不确定最好的方法是利用合并还是我可以简单地引用另一个df而不必合并(后一种方法似乎更理想,因为两个df都有500,000,000+行),但是我找不到在这种特定情况下,内容很多.
I'm not sure if the best approach utilizes merges or if I can simply refer to another df without having to merge (the latter approach would seem more ideal since both dfs have 500,000,000+ rows), but I can't find much content on this specific scenario.
df1 = pd.DataFrame({'KEY': ['100000555', '100000009','100000034','100000035', '100000036'],
'original_price': [1205.20,1253.25,1852.15,1452.36,1653.21],
'area': [12, 13, 12,12,12]})
df2 = pd.DataFrame({'KEY': ['100000555', '100000009', '100000009', '100000009', '100000009','100000034','100000034', '100000034'],
'payment_price': [134.04, 453.43, 422.32,23.23,10.43,10.47,243.09,23.45],
'Price_code': ['13', '13', '14','15','16','13','14','15']})
df1:
KEY area original_price
0 100000555 12 1205.20
1 100000009 13 1253.25
2 100000034 12 1852.15
3 100000035 12 1452.36
4 100000036 12 1653.21
df2:
KEY payment_price Price_code
0 100000555 134.04 13
1 100000009 453.43 13
2 100000009 422.32 14
3 100000009 23.23 15
4 100000009 10.43 16
5 100000034 10.47 13
6 100000034 243.09 14
7 100000034 23.45 15
我需要创建一个计算,如果它们与键匹配并且price_code值为13,14或15,则需要从df2中减去任何payment_price.
I need to create a calculation where I need to subtract any payment_price from df2 if they match the key and have price_code values of 13,14, or 15.
最终结果
KEY area original_price calculated_price
0 100000555 12 1205.20 1071.16 # (1205.20 - 134.04)
1 100000009 13 1253.25 354.27 # (1253.25 - 453.43 - 422.32 - 23.23)
2 100000034 12 1852.15 1575.14 # (1852.15 - 10.47 - 243.09 - 23.45)
3 100000035 12 1452.36 1452.36
4 100000036 12 1653.21 1653.21
我最初的想法是合并两个df,并使用groupby语句执行计算.但是我对此犹豫不决,因为这似乎占用了大量资源,而我的最终df将至少是行数的两倍.另外,我遇到了一个麻烦,要写出只包含某些price_codes的计算.所以现在我想知道是否有更好的方法.我愿意接受其他方法或对此脚本有所帮助. 我会坦白地说,我不确定如何为诸如此类的价格编写price_codes的条件语句.下面的代码首先合并dfs,然后创建一列(remaining_price).但是,对于键10000009,我只需要包含price_codes 12、14、15和排除16,但是当前包含16.
My initial inclination was to merge the two dfs and perform the calculation with a groupby statement. But my hesitation with this is that this seems resource heavy and my final df will be at least double the amount of rows. Additionally, I am running into a mental block to write the calculation to only include certain price_codes. So now I'm wondering if there is a better approach. I'm open to other approaches or help with this script. I will be honest in that I'm not entirely sure how to write the the conditionals for the price_codes for something like this. The code below first merges the dfs, then creates a column (remaining_price). However, for KEY 10000009 I need to include only the price_codes 12, 14, 15 and exclude 16, however 16 is currently included.
result = pd.merge(df1, df2,how='left', on='KEY')
codes = [13,14,15]
result['remaining_price'] = result['original_price'] - result['payment_price'].groupby(result['KEY']).transform('sum')
最后,我假设如果使用的是这种方法,则需要删除KEY上所有重复的行以及两个合并的列(price_code,payment_price).
Finally, I assume if this is the approach I use, that I would need to drop all duplicate rows on KEY and the two merged columns (price_code, payment_price).
result = result.drop_duplicates(subset=['KEY'],keep='first')
推荐答案
这是一种方法.不需要显式合并或删除重复项.这是您可能会看到性能改进的地方.
Here is one way. There is no need for an explicit merge or to drop duplicates. This is where you might see a performance improvement.
解决方案
s = df2[df2['Price_code'].isin([13, 14, 15])].groupby('KEY')['payment_price'].sum()
df1['calculated_price'] = df1['original_price'] - df1['KEY'].map(s).fillna(0)
结果
KEY area original_price calculated_price
0 100000555 12 1205.20 1071.16
1 100000009 13 1253.25 354.27
2 100000034 12 1852.15 1575.14
3 100000035 12 1452.36 1452.36
4 100000036 12 1653.21 1653.21
说明
- 根据需要按价格代码过滤
df2
,按KEY汇总付款价格,最后求和.结果是将KEY映射到付款金额的序列. - 使用
map
将这些总和映射到df1
中的KEY并从original_price中减去.
- Filter
df2
by Price_code as required, aggregate payment_price by KEY and finally sum. The result is a series mapping KEY to sum of payments. - Use
map
to map these summations to KEY indf1
and subtract from original_price.
这篇关于使用两个数据框计算最终值 pandas 的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!