根据多个条件合并两个数据帧 [英] Merge two data-frames based on multiple conditions

查看:49
本文介绍了根据多个条件合并两个数据帧的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在寻找比较两个数据帧(df-a和df-b),并从1个数据帧(df-b)中查找给定ID和日期的日期位于该ID与其他数据帧匹配的日期范围内的地方( df-a)。然后,我想剥离df-a中的所有列,并将它们连接到匹配的df-b中。例如,

I am looking to compare two dataframes (df-a and df-b) and search for where a given ID and date from 1 dataframe (df-b) sits within a date range where the ID matches in the other dataframe (df-a). I then want to strip all the columns in df-a and concat them to df-b where they match. E.g

如果我有数据框df-a,则采用以下格式
df-a:

If I have a dataframe df-a, in the following format df-a:

    ID       Start_Date    End_Date     A   B   C   D   E 
0   cd2      2020-06-01    2020-06-24   'a' 'b' 'c' 10  20
1   cd2      2020-06-24    2020-07-21
2   cd56     2020-06-10    2020-07-03
3   cd915    2020-04-28    2020-07-21
4   cd103    2020-04-13    2020-04-24

和df-b in

    ID      Date
0   cd2     2020-05-12
1   cd2     2020-04-12
2   cd2     2020-06-10
3   cd15    2020-04-28
4   cd193   2020-04-13

我想要一个输出df,所以df-c =

I would like an output df like so df-c=

    ID      Date        Start_Date  End_Date    A   B   C   D   E 
0   cd2     2020-05-12      -           -       -   -   -   -   -
1   cd2     2020-04-12      -           -       -   -   -   -   -
2   cd2     2020-06-10 2020-06-01 2020-06-11    'a' 'b' 'c' 10  20
3   cd15    2020-04-28      -           -       -   -   -   -   -
4   cd193   2020-04-13      -           -       -   -   -   -   -

在上一篇文章中,我得到了一个绝妙的答案,该答案允许比较数据帧并在满足此条件的任何地方删除,但我一直在努力寻找如何从df-a中适当提取信息的方法。当前尝试次数如下!

In a previous post I got a brilliant answer which allowed to compare the data-frames and drop wherever this condition was met, but I am struggling to figure out how to extract the information appropriately from df-a. Current attempts are below!

df_c=df_b.copy()

ar=[]
for i in range(df_c.shape[0]):
    currentID = df_c.stafnum[i]
    currentDate = df_c.Date[i]
    df_a_entriesForCurrentID = df_a.loc[df_a.stafnum == currentID]

    for j in range(df_a_entriesForCurrentID.shape[0]):
        startDate = df_a_entriesForCurrentID.iloc[j,:].Leave_Start_Date
        endDate = df_a_entriesForCurrentID.iloc[j,:].Leave_End_Date

        if (startDate <= currentDate <= endDate):
            print(df_c.loc[i])
            print(df_a_entriesForCurrentID.iloc[j,:])
            
            #df_d=pd.concat([df_c.loc[i], df_a_entriesForCurrentID.iloc[j,:]], axis=0)
            
            #df_fin_2=df_fin.append(df_d, ignore_index=True)
            #ar.append(df_d)


推荐答案



我注意到了您的问题中的一个问题: df-c的日期列从何而来?同样,ID cd15和 cd193不在两个数据框中。

目前,我使用以下数据框:

df-a:


I noticed one issue in your question: where do the dates in the "Date" column of df-c come from? Also, IDs 'cd15' and 'cd193' are not in either dataframe.

For now, I used the following dataframes:
df-a:

    ID      Start_Date     End_Date
0   cd2     2020-06-01     2020-06-11
1   cd2     2020-06-24     2020-07-21
2   cd56    2020-06-10     2020-07-03
3   cd915   2020-04-28     2020-07-21
4   cd103   2020-04-13     2020-04-24

df-b:

    ID      Start_Date  End_Date    A   B   C   D   E
0   cd2     2020-06-01  2020-06-24  a   b   c   10  20
1   cd2     2020-06-24  2020-07-21  0   0   0   0   0
2   cd56    2020-06-10  2020-07-03  0   0   0   0   0
3   cd915   2020-04-28  2020-07-21  0   0   0   0   0
4   cd103   2020-04-13  2020-04-24  0   0   0   0   0

df-c(在进行任何串联操作之前):

df-c (before any concatenation operations):

    ID     Date         Start_Date  End_Date    A    B    C    D    E
0   cd2    2020-05-12   0.0         0.0         0.0  0.0  0.0  0.0  0.0
1   cd2    2020-04-12   0.0         0.0         0.0  0.0  0.0  0.0  0.0
2   cd2    2020-06-10   0.0         0.0         0.0  0.0  0.0  0.0  0.0
3   cd15   2020-04-28   0.0         0.0         0.0  0.0  0.0  0.0  0.0
4   cd193  2020-04-13   0.0         0.0         0.0  0.0  0.0  0.0  0.0

以下是在df-c中执行条件级联的代码

for i in range(df_c.shape[0]):
    currentID = df_c.ID[i]
    currentDate = df_c.Date[i]
    df_a_entriesForCurrentID = df_a.loc[df_a.ID == currentID]
    
    for j in range(df_a_entriesForCurrentID.shape[0]):
        startDate = df_a_entriesForCurrentID.iloc[j,:].Start_Date
        endDate = df_a_entriesForCurrentID.iloc[j,:].End_Date
        
        if (startDate <= currentDate <= endDate):
            # Get A-E column values for the particular entry
            fullIDRow = df_b.loc[df_b.Start_Date == startDate]
            aVal = fullIDRow.A[0]
            bVal = fullIDRow.B[0]
            cVal = fullIDRow.C[0]
            dVal = fullIDRow.D[0]
            eVal = fullIDRow.E[0]
            
            # Add all the values (including the start/end dates) to df_c
            df_c.at[i, 'Start_Date'] = startDate.strftime('%Y-%m-%d')
            df_c.at[i, 'End_Date'] = endDate.strftime('%Y-%m-%d')
            df_c.at[i, 'A'] = aVal
            df_c.at[i, 'B'] = bVal
            df_c.at[i, 'C'] = cVal
            df_c.at[i, 'D'] = dVal
            df_c.at[i, 'E'] = eVal

输出(df_c)如下:

The output (df_c) is as follows:

    ID      Date         Start_Date   End_Date     A     B    C    D    E
0   cd2     2020-05-12   0.0          0.0          0.0   0.0  0.0  0.0  0.0
1   cd2     2020-04-12   0.0          0.0          0.0   0.0  0.0  0.0  0.0
2   cd2     2020-06-10   2020-06-01   2020-06-11   a     b    c    10   20
3   cd15    2020-04-28   0.0          0.0          0.0   0.0  0.0  0.0  0.0
4   cd193   2020-04-13   0.0          0.0          0.0   0.0  0.0  0.0  0.0

这篇关于根据多个条件合并两个数据帧的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆