大 pandas 从两个数据框中聚合数据 [英] pandas aggregate data from two data frames
问题描述
我有两个熊猫数据帧,其中一些索引和某些列名称是公共的(例如与公共数量有关的部分重叠的时间序列).
我需要将这两个数据帧合并到一个包含所有索引和每个索引的所有值的单个数据帧中,并在两个数据帧中均出现索引列组合的情况下,保持左侧(右侧)的值不变. /p>
合并和连接方法都无济于事,因为合并方法将复制我不需要的信息,并且连接会导致相同的问题.
什么是获得所需结果的有效方法?
例如,如果我有两个数据帧
df1 = pd.DataFrame({
'C1' : [1.1, 1.2, 1.3],
'C2' : [2.1, 2.2, 2.3],
'C3': [3.1, 3.2, 3.3]},
index=['a', 'b', 'c'])
df2 = pd.DataFrame({
'C3' : [3.1, 3.2, 33.3],
'C4' : [4.1, 4.2, 4.3]},
index=['b', 'c', 'd'])
我需要的是一种允许我创建的方法:
merged = pd.DataFrame({
'C1': [1.1, 1.2, 1.3, 'nan'],
'C2': [2.1, 2.2, 2.3, 'nan'],
'C3': [3.1, 3.2, 3.3, 33.3],
'C4': ['nan', 4.1, 4.2, 4.3]},
index=['a', 'b', 'c', 'd'])
以下是三种可能性:
-
使用
concat/groupby
:首先垂直连接两个DataFrame.然后按索引分组,然后选择每组中的第一行. -
使用
combine_first
:创建一个新索引,该索引是df1
和df2
的并集.使用新索引重新索引df1
.然后使用combine_first
用来自df2
的值填充NaN. -
使用手动构建:我们可以使用
df2.index.difference(df1.index)
来确切地找到需要添加到df1
的行.因此,我们可以从df2
中手动选择那些行,并将它们连接到df1
.
对于小型DataFrame,using_concat
更快.对于较大的DataFrame,using_combine_first
似乎比其他选项稍快:
import numpy as np
import pandas as pd
import perfplot
def make_dfs(N):
df1 = pd.DataFrame(np.random.randint(10, size=(N,2)))
df2 = pd.DataFrame(np.random.randint(10, size=(N,2)), index=range(N//2,N//2 + N))
return df1, df2
def using_concat(dfs):
df1, df2 = dfs
result = pd.concat([df1,df2], sort=False)
n = result.index.nlevels
return result.groupby(level=range(n)).first()
def using_combine_first(dfs):
df1, df2 = dfs
index = df1.index.union(df2.index)
result = df1.reindex(index)
result = result.combine_first(df2)
return result
def using_manual_construction(dfs):
df1, df2 = dfs
index = df2.index.difference(df1.index)
cols = df2.columns.difference(df1.columns)
result = pd.concat([df1, df2.loc[index]], sort=False)
result.loc[df2.index, cols] = df2
return result
perfplot.show(
setup=make_dfs,
kernels=[using_concat, using_combine_first,
using_manual_construction],
n_range=[2**k for k in range(5,21)],
logx=True,
logy=True,
xlabel='len(df)')
I have two pandas data frames, with some indexes and some column names in common (like partially overlapping time-series related to common quantities).
I need to merge these two dataframes in a single one containing all the indexes and all the values for each index, keeping the values of the left (right) one in case an index-column combination appears in both data frames.
Both merge and join methods are unhelpful as the merge method will duplicate information I don't need and join causes the same problem.
What's an efficient method to obtain the result I need?
EDIT: If for example I have the two data frames
df1 = pd.DataFrame({
'C1' : [1.1, 1.2, 1.3],
'C2' : [2.1, 2.2, 2.3],
'C3': [3.1, 3.2, 3.3]},
index=['a', 'b', 'c'])
df2 = pd.DataFrame({
'C3' : [3.1, 3.2, 33.3],
'C4' : [4.1, 4.2, 4.3]},
index=['b', 'c', 'd'])
What I need is a method that allows me to create:
merged = pd.DataFrame({
'C1': [1.1, 1.2, 1.3, 'nan'],
'C2': [2.1, 2.2, 2.3, 'nan'],
'C3': [3.1, 3.2, 3.3, 33.3],
'C4': ['nan', 4.1, 4.2, 4.3]},
index=['a', 'b', 'c', 'd'])
Here are three possibilities:
Use
concat/groupby
: First concatenate both DataFrames vertically. Then group by the index and select the first row in each group.Use
combine_first
: Make a new index which is the union ofdf1
anddf2
. Reindexdf1
using the new index. Then usecombine_first
to fill in NaNs with values fromdf2
.Use manual construction: We could use
df2.index.difference(df1.index)
to find exactly which rows need to be added todf1
. So we could manually select those rows fromdf2
and concatenate them on todf1
.
For small DataFrames, using_concat
is faster. For larger DataFrames, using_combine_first
appears to be slightly faster than the other options:
import numpy as np
import pandas as pd
import perfplot
def make_dfs(N):
df1 = pd.DataFrame(np.random.randint(10, size=(N,2)))
df2 = pd.DataFrame(np.random.randint(10, size=(N,2)), index=range(N//2,N//2 + N))
return df1, df2
def using_concat(dfs):
df1, df2 = dfs
result = pd.concat([df1,df2], sort=False)
n = result.index.nlevels
return result.groupby(level=range(n)).first()
def using_combine_first(dfs):
df1, df2 = dfs
index = df1.index.union(df2.index)
result = df1.reindex(index)
result = result.combine_first(df2)
return result
def using_manual_construction(dfs):
df1, df2 = dfs
index = df2.index.difference(df1.index)
cols = df2.columns.difference(df1.columns)
result = pd.concat([df1, df2.loc[index]], sort=False)
result.loc[df2.index, cols] = df2
return result
perfplot.show(
setup=make_dfs,
kernels=[using_concat, using_combine_first,
using_manual_construction],
n_range=[2**k for k in range(5,21)],
logx=True,
logy=True,
xlabel='len(df)')
这篇关于大 pandas 从两个数据框中聚合数据的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!