大 pandas 从两个数据框中聚合数据 [英] pandas aggregate data from two data frames

查看:82
本文介绍了大 pandas 从两个数据框中聚合数据的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有两个熊猫数据帧,其中一些索引和某些列名称是公共的(例如与公共数量有关的部分重叠的时间序列).

我需要将这两个数据帧合并到一个包含所有索引和每个索引的所有值的单个数据帧中,并在两个数据帧中均出现索引列组合的情况下,保持左侧(右侧)的值不变. /p>

合并和连接方法都无济于事,因为合并方法将复制我不需要的信息,并且连接会导致相同的问题.

什么是获得所需结果的有效方法?

例如,如果我有两个数据帧

df1 = pd.DataFrame({
'C1' : [1.1, 1.2, 1.3],
'C2' : [2.1, 2.2, 2.3],
'C3': [3.1, 3.2, 3.3]},
index=['a', 'b', 'c'])

df2 = pd.DataFrame({
'C3' : [3.1, 3.2, 33.3],
'C4' : [4.1, 4.2, 4.3]},
index=['b', 'c', 'd'])

我需要的是一种允许我创建的方法:

merged = pd.DataFrame({
'C1': [1.1, 1.2, 1.3, 'nan'],
'C2': [2.1, 2.2, 2.3, 'nan'],
'C3': [3.1, 3.2, 3.3, 33.3], 
'C4': ['nan', 4.1, 4.2, 4.3]},
index=['a', 'b', 'c', 'd'])

解决方案

以下是三种可能性:

  • 使用concat/groupby:首先垂直连接两个DataFrame.然后按索引分组,然后选择每组中的第一行.

  • 使用combine_first:创建一个新索引,该索引是df1df2的并集.使用新索引重新索引df1.然后使用combine_first用来自df2的值填充NaN.

  • 使用手动构建:我们可以使用df2.index.difference(df1.index)来确切地找到需要添加到df1的行.因此,我们可以从df2中手动选择那些行,并将它们连接到df1.

对于小型DataFrame,using_concat更快.对于较大的DataFrame,using_combine_first似乎比其他选项稍快:

import numpy as np
import pandas as pd
import perfplot

def make_dfs(N):
    df1 = pd.DataFrame(np.random.randint(10, size=(N,2)))
    df2 = pd.DataFrame(np.random.randint(10, size=(N,2)), index=range(N//2,N//2 + N))
    return df1, df2

def using_concat(dfs):
    df1, df2 = dfs
    result = pd.concat([df1,df2], sort=False)
    n = result.index.nlevels
    return result.groupby(level=range(n)).first()

def using_combine_first(dfs):
    df1, df2 = dfs
    index = df1.index.union(df2.index)
    result = df1.reindex(index)
    result = result.combine_first(df2)
    return result

def using_manual_construction(dfs):
    df1, df2 = dfs
    index = df2.index.difference(df1.index)
    cols = df2.columns.difference(df1.columns)
    result = pd.concat([df1, df2.loc[index]], sort=False)
    result.loc[df2.index, cols] = df2
    return result

perfplot.show(
    setup=make_dfs,
    kernels=[using_concat, using_combine_first, 
             using_manual_construction],
    n_range=[2**k for k in range(5,21)],
    logx=True,
    logy=True,
    xlabel='len(df)')

I have two pandas data frames, with some indexes and some column names in common (like partially overlapping time-series related to common quantities).

I need to merge these two dataframes in a single one containing all the indexes and all the values for each index, keeping the values of the left (right) one in case an index-column combination appears in both data frames.

Both merge and join methods are unhelpful as the merge method will duplicate information I don't need and join causes the same problem.

What's an efficient method to obtain the result I need?

EDIT: If for example I have the two data frames

df1 = pd.DataFrame({
'C1' : [1.1, 1.2, 1.3],
'C2' : [2.1, 2.2, 2.3],
'C3': [3.1, 3.2, 3.3]},
index=['a', 'b', 'c'])

df2 = pd.DataFrame({
'C3' : [3.1, 3.2, 33.3],
'C4' : [4.1, 4.2, 4.3]},
index=['b', 'c', 'd'])

What I need is a method that allows me to create:

merged = pd.DataFrame({
'C1': [1.1, 1.2, 1.3, 'nan'],
'C2': [2.1, 2.2, 2.3, 'nan'],
'C3': [3.1, 3.2, 3.3, 33.3], 
'C4': ['nan', 4.1, 4.2, 4.3]},
index=['a', 'b', 'c', 'd'])

解决方案

Here are three possibilities:

  • Use concat/groupby: First concatenate both DataFrames vertically. Then group by the index and select the first row in each group.

  • Use combine_first: Make a new index which is the union of df1 and df2. Reindex df1 using the new index. Then use combine_first to fill in NaNs with values from df2.

  • Use manual construction: We could use df2.index.difference(df1.index) to find exactly which rows need to be added to df1. So we could manually select those rows from df2 and concatenate them on to df1.

For small DataFrames, using_concat is faster. For larger DataFrames, using_combine_first appears to be slightly faster than the other options:

import numpy as np
import pandas as pd
import perfplot

def make_dfs(N):
    df1 = pd.DataFrame(np.random.randint(10, size=(N,2)))
    df2 = pd.DataFrame(np.random.randint(10, size=(N,2)), index=range(N//2,N//2 + N))
    return df1, df2

def using_concat(dfs):
    df1, df2 = dfs
    result = pd.concat([df1,df2], sort=False)
    n = result.index.nlevels
    return result.groupby(level=range(n)).first()

def using_combine_first(dfs):
    df1, df2 = dfs
    index = df1.index.union(df2.index)
    result = df1.reindex(index)
    result = result.combine_first(df2)
    return result

def using_manual_construction(dfs):
    df1, df2 = dfs
    index = df2.index.difference(df1.index)
    cols = df2.columns.difference(df1.columns)
    result = pd.concat([df1, df2.loc[index]], sort=False)
    result.loc[df2.index, cols] = df2
    return result

perfplot.show(
    setup=make_dfs,
    kernels=[using_concat, using_combine_first, 
             using_manual_construction],
    n_range=[2**k for k in range(5,21)],
    logx=True,
    logy=True,
    xlabel='len(df)')

这篇关于大 pandas 从两个数据框中聚合数据的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆