pandas :合并数据框,但对重叠的列求和 [英] Pandas: merge data frame but summing overlapping columns
问题描述
我一直在阅读许多有关pandas.DataFrames
的merge()
和join()
方法的文章,并尝试解决我自己的问题,但仍未找到解决方案.
I've been reading lots of posts about the merge()
and join()
methods of pandas.DataFrames
, and trying these on my own problem but not quite found a solution.
我有一个非常大的数据文件(.csv),其中包含每小时用于各种ID的内容.我想汇总每个ID在每个月的消费情况.
I have a very large data file (.csv) containing the hourly consumption of something for various IDs. I want to aggregate the consumption for each ID over each month.
由于内存的限制,我需要使用read_csv
来分块处理小时消耗文件(使用chunk_size
选项),最后要花费几个月的ID消耗数据帧的负载,例如:
Due to memory limitations I need to process the hourly consumption file with read_csv
in chunks (using chunk_size
option), and end up with a load of DataFrames of consumption for IDs for some months, e.g.:
df1 =
Month Dec Nov
ID
XXX 4.0 1.0
YYY 8.0 3.0
ZZZ 4.0 1.0
df2 =
Month Dec Nov Oct
ID
AAA 1.0 7.0 9.0
BBB 0.0 NaN 2.0
YYY 5.0 5.0 0.0
为此帖子生成的来源:
df1 = pd.DataFrame({'ID': ['XXX','XXX','YYY','YYY','ZZZ','ZZZ'],
'Month': ['Nov','Dec']*3,
'Consumption': [1.0,4.0,3.0,8.0,1.0,4.0]})
df1 = df1.pivot(index='ID', columns='Month', values='Consumption')
df2 = pd.DataFrame({'ID': ['AAA','AAA','AAA','YYY','YYY','YYY','BBB','BBB','BBB'],
'Month': ['Oct','Nov','Dec']*3,
'Consumption': [9,7,1,0,5,5,2,np.nan,0]})
df2 = df2.pivot(index='ID', columns='Month', values='Consumption')
请注意,0.0
和NaN
的消耗量有所不同. 0.0
表示一个月中至少有一个0.0
消耗量读数,但是NaN
表示根本没有记录任何消耗量值,在这种情况下,不能假定为0.为了我的目的,必须保持区别.
Note that there is a difference between a consumption of 0.0
and NaN
. 0.0
means there was at least one consumption reading of 0.0
in the month, but NaN
means there was no consumption value logged at all and, in this case, 0 cannot be assumed. For my purposes this difference must be kept distinct.
由于数据文件是按块处理的,因此有多个ID出现在多个DataFrame中,例如YYY
,对于这些ID,有时月份也重叠,例如ID为YYY
的Nov
.在这种情况下,该月上半月的消费在df1
中,而下半月在df2
.
Because the data file is processed in chunks, there are some IDs that appear in more than one DataFrame, e.g. YYY
, and, for those IDs, sometimes the months overlap too, e.g. Nov
for ID YYY
. In this case, consumption in the first half of the month is in df1
and the second half is in df2
.
因此,要汇总消耗量,我需要按"ID"合并这些数据帧,然后将月"中的值相加.
To aggregate the consumption I therefore need to merge these DataFrames by 'ID' and sum values in overlapping 'Months'.
直接求和:DataFrame会产生许多NaN:
Straight summing the DataFrames yields many NaNs:
df1 + df2 =
Month Dec Nov Oct
ID
AAA NaN NaN NaN
BBB NaN NaN NaN
XXX NaN NaN NaN
YYY 13.0 8.0 NaN
ZZZ NaN NaN NaN
我认为这是因为当对未显示df2
的df1
的ID/月份进行求和时,它将返回NaN.
I assume this is because when summing IDs/Months of df1
that don't appear df2
it returns a NaN.
外部合并会产生重叠月份的后缀列:
Outer merging produces the suffixed columns for overlapping months:
df1.merge(df2,how='outer',on='ID') =
Month Dec_x Nov_x Dec_y Nov_y Oct
ID
XXX 4.0 1.0 NaN NaN NaN
YYY 8.0 3.0 5.0 5.0 0.0
ZZZ 4.0 1.0 NaN NaN NaN
AAA NaN NaN 1.0 7.0 9.0
BBB NaN NaN 0.0 NaN 2.0
我也无法让combine_first
做我想做的事.
I couldn't get combine_first
to do what I want either.
我想要的是中间的东西,像这样:
What I want is something in the middle, that looks like this:
Month Dec Nov Oct
ID
XXX 4.0 1.0 NaN
YYY 13.0 8.0 0.0
ZZZ 4.0 1.0 NaN
AAA 1.0 7.0 9.0
BBB 0.0 NaN 2.0
其中重叠的月份相加为x + NaN = x
,NaN + y = y
和NaN + NaN = NaN
.
Where overlapping months are summed such that x + NaN = x
, NaN + y = y
and NaN + NaN = NaN
.
我看到的一种解决方案是合并,然后对重叠的列求和,而忽略NaN:
One solution I can see to do the merge, then sum the overlapping columns, ignoring NaNs:
df3 = df1.merge(df2,how='outer',on='ID',suffixes=['','_x'])
overlapping_months_sufx = df3.columns.values[df3.columns.str.endswith('_x')]
for mnth_sufx in overlapping_months_sufx:
mnth = mnth_sufx[:-2]
df3[mnth][df3[mnth_sufx].notnull()] = df3[mnth].fillna(0) + df3[mnth_sufx]
df3=df3.drop(columns=mnth_sufx)
df3 =
Month Dec Nov Oct
ID
XXX 4.0 1.0 NaN
YYY 13.0 8.0 0.0
ZZZ 4.0 1.0 NaN
AAA 1.0 7.0 9.0
BBB 0.0 NaN 2.0
鉴于此数据集的大小,拥有一种最有效的方式来汇总所有数据的方法将是很棒的.有没有更好的方法,也许一步一步完成?
Given the size of this data set it would be great to have the most efficient way to aggregate all this. Is there a better way to do it, perhaps in a single step?
谢谢, 克里斯
推荐答案
这里是尝试.如果我理解正确,请发表评论.
Here's an attempt. Please leave a comment if I understood correctly.
给出:
>>> df1
Month Dec Nov
ID
XXX 4.0 1.0
YYY 8.0 3.0
ZZZ 4.0 1.0
>>> df2
Month Dec Nov Oct
ID
AAA 1.0 7.0 9.0
BBB 0.0 NaN 2.0
YYY 5.0 5.0 0.0
解决方案:
>>> pd.concat([df1, df2]).reset_index().groupby('ID', sort=False).sum(min_count=1)
Dec Nov Oct
ID
XXX 4.0 1.0 NaN
YYY 13.0 8.0 0.0
ZZZ 4.0 1.0 NaN
AAA 1.0 7.0 9.0
BBB 0.0 NaN 2.0
说明:
Explanation:
串联只是将df2
放在df1
下.
>>> cat = pd.concat([df1, df2])
>>> cat
Dec Nov Oct
ID
XXX 4.0 1.0 NaN
YYY 8.0 3.0 NaN
ZZZ 4.0 1.0 NaN
AAA 1.0 7.0 9.0
BBB 0.0 NaN 2.0
YYY 5.0 5.0 0.0
reset_index
将索引移到列中.
>>> cat = cat.reset_index()
>>> cat
ID Dec Nov Oct
0 XXX 4.0 1.0 NaN
1 YYY 8.0 3.0 NaN
2 ZZZ 4.0 1.0 NaN
3 AAA 1.0 7.0 9.0
4 BBB 0.0 NaN 2.0
5 YYY 5.0 5.0 0.0
我这样做是为了使有一个名为'ID'
的列,通过该列可以对其他值进行分组. groupby('ID', sort=False)
在'ID'
列中创建共享相同值的行组(并且sort=False
确保最终结果中的行未排序以匹配您的输出).
I did this such that there is a column with the name 'ID'
by which I can group the other values. groupby('ID', sort=False)
creates groups of rows sharing the same value in the 'ID'
column (and sort=False
ensures that the rows in the final result are not sorted to match your output).
我们可以像这样检查组的大小:
We can check the group sizes like this:
>>> cat.groupby('ID', sort=False).size()
ID
XXX 1
YYY 2
ZZZ 1
AAA 1
BBB 1
dtype: int64
如您所见,我们只有一组大小为2的组,因为'YYY'
ID是唯一重复的一组.
As you can see we only have one group of size two because the 'YYY'
ID is the only duplicated one.
sum(min_count=1)
的工作原理是:将每个组中的值相对于其列进行汇总.参数min_count=1
可以确保所有NaN
值的序列在求和时得出NaN
.
sum(min_count=1)
works like this: the values in each group are summed up with respect to their column. The parameter min_count=1
ensures that a series of all NaN
values results in NaN
when summed up.
>>> cat.groupby('ID', sort=False).sum(min_count=1)
Dec Nov Oct
ID
XXX 4.0 1.0 NaN
YYY 13.0 8.0 0.0
ZZZ 4.0 1.0 NaN
AAA 1.0 7.0 9.0
BBB 0.0 NaN 2.0
min_count
的演示:
>>> s = pd.Series([np.nan, np.nan])
>>> s
0 NaN
1 NaN
dtype: float64
>>>
>>> s.sum()
0.0
>>> s.sum(min_count=1)
nan
>>> s[0] = 1
>>> s
0 1.0
1 NaN
dtype: float64
>>> s.sum()
1.0
>>> s.sum(min_count=1)
1.0
>>> s.sum(min_count=2)
nan
这篇关于 pandas :合并数据框,但对重叠的列求和的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!