将两个数据框与一些通用列合并,其中通用的组合需要是自定义函数 [英] merge two dataframes with some common columns where the combining of the common needs to be a custom function

查看:183
本文介绍了将两个数据框与一些通用列合并,其中通用的组合需要是自定义函数的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我的问题与与列操作合并熊猫数据框非常相似但这不能满足我的需求.

my question is very similar to Merge pandas dataframe, with column operation but it doesn't answer my needs.

假设我有两个数据框,例如(请注意,数据框的内容可以是浮点数,而不是布尔值):

Let's say I have two dataframes such as (note that the dataframe content could be float numbers instead of booleans):

left = pd.DataFrame({0: [True, True, False], 0.5: [False, True, True]}, index=[12.5, 14, 15.5])
right = pd.DataFrame({0.7: [True, False, False], 0.5: [True, False, True]}, index=[12.5, 14, 15.5])

正确

        0.5    0.7
12.5   True   True
14.0  False  False
15.5   True  False

        0.0    0.5
12.5   True  False
14.0   True   True
15.5  False   True

如您所见,它们具有相同的索引,并且其中一列是常见的.在现实生活中,可能会出现更多常见的列,例如1.0处的其他列或尚未定义的其他数字,并且每一侧都有更多唯一的列. 我需要结合两个数据帧,以便保留所有唯一列,并使用特定的功能(例如,此示例为布尔型OR,而两个数据帧的索引始终相同.

As you can see they have the same indexes and one of the column is common. In real life there might be more common columns such as one more at 1.0 or other numbers not yet defined, and more unique columns on each side. I need to combine the two dataframes such that all unique columns are kept and the common columns are combined using a specific function e.g. a boolean OR for this example, while the indexes are always identical for both dataframes.

所以结果应该是:

        0.0   0.5    0.7
12.5   True  True   True
14.0   True  True  False
15.5  False  True  False

在现实生活中,需要组合两个以上的数据帧,但是可以将它们一个接一个地依次组合到一个空的第一个数据帧.

In real life there will be more than two dataframes that need to be combined, but they can be combined sequentially one after the other to an empty first dataframe.

我觉得pandas.combine可以解决这个问题,但是我无法从文档中找出答案.任何人都将对如何一步或多步提出建议.

I feel pandas.combine might do the trick but I can't figure it out from the documentation. Anybody would have a suggestion on how to do it in one or more steps.

推荐答案

您可以连接数据框,然后按列名分组以对类似命名的列进行操作:在这种情况下,您可以避免求和然后将其类型转换回bool以获取or操作.

You can concatenate the dataframes, and then groupby the column names to apply an operation on the similarly named columns: In this case you can get away with taking the sum and then typecasting back to bool to get the or operation.

import pandas as pd

df = pd.concat([left, right], 1)
df.groupby(df.columns, 1).sum().astype(bool)

输出:

        0.0   0.5    0.7
12.5   True  True   True
14.0   True  True  False
15.5  False  True  False


如果您需要查看如何以不太特定于案例的方式执行此操作,则再次按列进行分组,然后在axis=1

df = pd.concat([left, right], 1)
df.groupby(df.columns, 1).apply(lambda x: x.any(1))
#        0.0   0.5    0.7
#12.5   True  True   True
#14.0   True  True  False
#15.5  False  True  False


此外,您可以定义自定义合并功能.这是将左框架的两倍添加到右框架的4倍的视图.如果只有一列,则返回左帧的2倍.


Further, you can define a custom combining function. Here's one which adds twice the left Frame to 4 times the right Frame. If there is only one column, it returns 2x the left frame.

左:

      0.0  0.5
12.5    1   11
14.0    2   17
15.5    3   17

右:

      0.7  0.5
12.5    4    2
14.0    4   -1
15.5    5    5

代码

def my_func(x):
    try:
        res = x.iloc[:, 0]*2 + x.iloc[:, 1]*4
    except IndexError:
        res = x.iloc[:, 0]*2
    return res

df = pd.concat([left, right], 1)
df.groupby(df.columns, 1).apply(lambda x: my_func(x))

输出:

      0.0  0.5  0.7
12.5    2   30    8
14.0    4   30    8
15.5    6   54   10


最后,如果要连续执行此操作,则应使用reduce.在这里,我将5 DataFrames与上述功能结合在一起. (我将为示例重复正确的第4帧)


Finally, if you wanted to do this in a consecutive manner, then you should make use of reduce. Here I'll combine 5 DataFrames with the above function. (I'll just repeat the right Frame 4x for the example)

from functools import reduce

def my_comb(df_l, df_r, func):
    """ Concatenate df_l and df_r along axis=1. Apply the
    specified function.
    """
    df = pd.concat([df_l, df_r], 1)
    return df.groupby(df.columns, 1).apply(lambda x: func(x))

reduce(lambda dfl, dfr: my_comb(dfl, dfr, func=my_func), [left, right, right, right, right])
#      0.0  0.5  0.7
#12.5   16  296  176
#14.0   32  212  176
#15.5   48  572  220

这篇关于将两个数据框与一些通用列合并,其中通用的组合需要是自定义函数的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆