pandas :合并数据框,但对重叠的列求和 [英] Pandas: merge data frame but summing overlapping columns

查看:42
本文介绍了 pandas :合并数据框,但对重叠的列求和的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我一直在阅读许多有关pandas.DataFramesmerge()join()方法的文章,并尝试解决我自己的问题,但仍未找到解决方案.

I've been reading lots of posts about the merge() and join() methods of pandas.DataFrames, and trying these on my own problem but not quite found a solution.

我有一个非常大的数据文件(.csv),其中包含每小时用于各种ID的内容.我想汇总每个ID在每个月的消费情况.

I have a very large data file (.csv) containing the hourly consumption of something for various IDs. I want to aggregate the consumption for each ID over each month.

由于内存的限制,我需要使用read_csv来分块处理小时消耗文件(使用chunk_size选项),最后要花费几个月的ID消耗数据帧的负载,例如:

Due to memory limitations I need to process the hourly consumption file with read_csv in chunks (using chunk_size option), and end up with a load of DataFrames of consumption for IDs for some months, e.g.:

df1 = 
 Month  Dec  Nov
ID             
XXX    4.0  1.0
YYY    8.0  3.0
ZZZ    4.0  1.0 

df2 = 
 Month  Dec  Nov  Oct
ID                  
AAA    1.0  7.0  9.0
BBB    0.0  NaN  2.0
YYY    5.0  5.0  0.0

为此帖子生成的来源:

df1 = pd.DataFrame({'ID': ['XXX','XXX','YYY','YYY','ZZZ','ZZZ'], 
                    'Month': ['Nov','Dec']*3, 
                    'Consumption': [1.0,4.0,3.0,8.0,1.0,4.0]})
df1 = df1.pivot(index='ID', columns='Month', values='Consumption')
df2 = pd.DataFrame({'ID': ['AAA','AAA','AAA','YYY','YYY','YYY','BBB','BBB','BBB'], 
                    'Month': ['Oct','Nov','Dec']*3, 
                    'Consumption': [9,7,1,0,5,5,2,np.nan,0]})
df2 = df2.pivot(index='ID', columns='Month', values='Consumption')

请注意,0.0NaN的消耗量有所不同. 0.0表示一个月中至少有一个0.0消耗量读数,但是NaN表示根本没有记录任何消耗量值,在这种情况下,不能假定为0.为了我的目的,必须保持区别.

Note that there is a difference between a consumption of 0.0 and NaN. 0.0 means there was at least one consumption reading of 0.0 in the month, but NaN means there was no consumption value logged at all and, in this case, 0 cannot be assumed. For my purposes this difference must be kept distinct.

由于数据文件是按块处理的,因此有多个ID出现在多个DataFrame中,例如YYY,对于这些ID,有时月份也重叠,例如ID为YYYNov.在这种情况下,该月上半月的消费在df1中,而下半月在df2.

Because the data file is processed in chunks, there are some IDs that appear in more than one DataFrame, e.g. YYY, and, for those IDs, sometimes the months overlap too, e.g. Nov for ID YYY. In this case, consumption in the first half of the month is in df1 and the second half is in df2.

因此,要汇总消耗量,我需要按"ID"合并这些数据帧,然后将月"中的值相加.

To aggregate the consumption I therefore need to merge these DataFrames by 'ID' and sum values in overlapping 'Months'.

直接求和:DataFrame会产生许多NaN:

Straight summing the DataFrames yields many NaNs:

df1 + df2 = 
 Month   Dec  Nov  Oct
ID                   
AAA     NaN  NaN  NaN
BBB     NaN  NaN  NaN
XXX     NaN  NaN  NaN
YYY    13.0  8.0  NaN
ZZZ     NaN  NaN  NaN

我认为这是因为当对未显示df2df1的ID/月份进行求和时,它将返回NaN.

I assume this is because when summing IDs/Months of df1 that don't appear df2 it returns a NaN.

外部合并会产生重叠月份的后缀列:

Outer merging produces the suffixed columns for overlapping months:

df1.merge(df2,how='outer',on='ID') = 
 Month  Dec_x  Nov_x  Dec_y  Nov_y  Oct
ID                                    
XXX      4.0    1.0    NaN    NaN  NaN
YYY      8.0    3.0    5.0    5.0  0.0
ZZZ      4.0    1.0    NaN    NaN  NaN
AAA      NaN    NaN    1.0    7.0  9.0
BBB      NaN    NaN    0.0    NaN  2.0

我也无法让combine_first做我想做的事.

I couldn't get combine_first to do what I want either.

我想要的是中间的东西,像这样:

What I want is something in the middle, that looks like this:

 Month   Dec  Nov  Oct
ID                   
XXX      4.0  1.0  NaN
YYY     13.0  8.0  0.0
ZZZ      4.0  1.0  NaN
AAA      1.0  7.0  9.0
BBB      0.0  NaN  2.0

其中重叠的月份相加为x + NaN = xNaN + y = yNaN + NaN = NaN.

Where overlapping months are summed such that x + NaN = x, NaN + y = y and NaN + NaN = NaN.

我看到的一种解决方案是合并,然后对重叠的列求和,而忽略NaN:

One solution I can see to do the merge, then sum the overlapping columns, ignoring NaNs:

df3 = df1.merge(df2,how='outer',on='ID',suffixes=['','_x'])
overlapping_months_sufx = df3.columns.values[df3.columns.str.endswith('_x')]
for mnth_sufx in overlapping_months_sufx:
    mnth = mnth_sufx[:-2]
    df3[mnth][df3[mnth_sufx].notnull()] = df3[mnth].fillna(0) + df3[mnth_sufx]
    df3=df3.drop(columns=mnth_sufx)
df3 = 
 Month   Dec  Nov  Oct
ID                   
XXX     4.0  1.0  NaN
YYY    13.0  8.0  0.0
ZZZ     4.0  1.0  NaN
AAA     1.0  7.0  9.0
BBB     0.0  NaN  2.0

鉴于此数据集的大小,拥有一种最有效的方式来汇总所有数据的方法将是很棒的.有没有更好的方法,也许一步一步完成?

Given the size of this data set it would be great to have the most efficient way to aggregate all this. Is there a better way to do it, perhaps in a single step?

谢谢, 克里斯

推荐答案

这里是尝试.如果我理解正确,请发表评论.

Here's an attempt. Please leave a comment if I understood correctly.

给出:

>>> df1                                                                                                                
Month  Dec  Nov
ID             
XXX    4.0  1.0
YYY    8.0  3.0
ZZZ    4.0  1.0
>>> df2                                                                                                                
Month  Dec  Nov  Oct
ID                  
AAA    1.0  7.0  9.0
BBB    0.0  NaN  2.0
YYY    5.0  5.0  0.0

解决方案:

>>> pd.concat([df1, df2]).reset_index().groupby('ID', sort=False).sum(min_count=1)
      Dec  Nov  Oct
ID                 
XXX   4.0  1.0  NaN
YYY  13.0  8.0  0.0
ZZZ   4.0  1.0  NaN
AAA   1.0  7.0  9.0
BBB   0.0  NaN  2.0


说明:


Explanation:

串联只是将df2放在df1下.

>>> cat = pd.concat([df1, df2])                                                                                        
>>> cat                                                                                                                
     Dec  Nov  Oct
ID                
XXX  4.0  1.0  NaN
YYY  8.0  3.0  NaN
ZZZ  4.0  1.0  NaN
AAA  1.0  7.0  9.0
BBB  0.0  NaN  2.0
YYY  5.0  5.0  0.0

reset_index将索引移到列中.

>>> cat = cat.reset_index()                                                                                            
>>> cat                                                                                                                
    ID  Dec  Nov  Oct
0  XXX  4.0  1.0  NaN
1  YYY  8.0  3.0  NaN
2  ZZZ  4.0  1.0  NaN
3  AAA  1.0  7.0  9.0
4  BBB  0.0  NaN  2.0
5  YYY  5.0  5.0  0.0

我这样做是为了使有一个名为'ID'的列,通过该列可以对其他值进行分组. groupby('ID', sort=False)'ID'列中创建共享相同值的行组(并且sort=False确保最终结果中的行未排序以匹配您的输出).

I did this such that there is a column with the name 'ID' by which I can group the other values. groupby('ID', sort=False) creates groups of rows sharing the same value in the 'ID' column (and sort=False ensures that the rows in the final result are not sorted to match your output).

我们可以像这样检查组的大小:

We can check the group sizes like this:

>>> cat.groupby('ID', sort=False).size()                                                                               
ID
XXX    1
YYY    2
ZZZ    1
AAA    1
BBB    1
dtype: int64

如您所见,我们只有一组大小为2的组,因为'YYY' ID是唯一重复的一组.

As you can see we only have one group of size two because the 'YYY' ID is the only duplicated one.

sum(min_count=1)的工作原理是:将每个组中的值相对于其列进行汇总.参数min_count=1可以确保所有NaN值的序列在求和时得出NaN.

sum(min_count=1) works like this: the values in each group are summed up with respect to their column. The parameter min_count=1 ensures that a series of all NaN values results in NaN when summed up.

>>> cat.groupby('ID', sort=False).sum(min_count=1)                                                      
      Dec  Nov  Oct
ID                 
XXX   4.0  1.0  NaN
YYY  13.0  8.0  0.0
ZZZ   4.0  1.0  NaN
AAA   1.0  7.0  9.0
BBB   0.0  NaN  2.0

min_count的演示:

>>> s = pd.Series([np.nan, np.nan])                                                                                    
>>> s                                                                                                                  
0   NaN
1   NaN
dtype: float64
>>>                                                                                                                    
>>> s.sum()                                                                                                            
0.0
>>> s.sum(min_count=1)                                                                                                 
nan
>>> s[0] = 1                                                                                                           
>>> s                                                                                                                  
0    1.0
1    NaN
dtype: float64
>>> s.sum()                                                                                                            
1.0
>>> s.sum(min_count=1)                                                                                                 
1.0
>>> s.sum(min_count=2)                                                                                                 
nan

这篇关于 pandas :合并数据框,但对重叠的列求和的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆