根据多个条件合并两个数据帧 [英] Merge two data-frames based on multiple conditions
问题描述
我正在寻找比较两个数据帧(df-a和df-b),并从1个数据帧(df-b)中查找给定ID和日期的日期位于该ID与其他数据帧匹配的日期范围内的地方( df-a)。然后,我想剥离df-a中的所有列,并将它们连接到匹配的df-b中。例如,
I am looking to compare two dataframes (df-a and df-b) and search for where a given ID and date from 1 dataframe (df-b) sits within a date range where the ID matches in the other dataframe (df-a). I then want to strip all the columns in df-a and concat them to df-b where they match. E.g
如果我有数据框df-a,则采用以下格式
df-a:
If I have a dataframe df-a, in the following format df-a:
ID Start_Date End_Date A B C D E
0 cd2 2020-06-01 2020-06-24 'a' 'b' 'c' 10 20
1 cd2 2020-06-24 2020-07-21
2 cd56 2020-06-10 2020-07-03
3 cd915 2020-04-28 2020-07-21
4 cd103 2020-04-13 2020-04-24
和df-b in
ID Date
0 cd2 2020-05-12
1 cd2 2020-04-12
2 cd2 2020-06-10
3 cd15 2020-04-28
4 cd193 2020-04-13
我想要一个输出df,所以df-c =
I would like an output df like so df-c=
ID Date Start_Date End_Date A B C D E
0 cd2 2020-05-12 - - - - - - -
1 cd2 2020-04-12 - - - - - - -
2 cd2 2020-06-10 2020-06-01 2020-06-11 'a' 'b' 'c' 10 20
3 cd15 2020-04-28 - - - - - - -
4 cd193 2020-04-13 - - - - - - -
在上一篇文章中,我得到了一个绝妙的答案,该答案允许比较数据帧并在满足此条件的任何地方删除,但我一直在努力寻找如何从df-a中适当提取信息的方法。当前尝试次数如下!
In a previous post I got a brilliant answer which allowed to compare the data-frames and drop wherever this condition was met, but I am struggling to figure out how to extract the information appropriately from df-a. Current attempts are below!
df_c=df_b.copy()
ar=[]
for i in range(df_c.shape[0]):
currentID = df_c.stafnum[i]
currentDate = df_c.Date[i]
df_a_entriesForCurrentID = df_a.loc[df_a.stafnum == currentID]
for j in range(df_a_entriesForCurrentID.shape[0]):
startDate = df_a_entriesForCurrentID.iloc[j,:].Leave_Start_Date
endDate = df_a_entriesForCurrentID.iloc[j,:].Leave_End_Date
if (startDate <= currentDate <= endDate):
print(df_c.loc[i])
print(df_a_entriesForCurrentID.iloc[j,:])
#df_d=pd.concat([df_c.loc[i], df_a_entriesForCurrentID.iloc[j,:]], axis=0)
#df_fin_2=df_fin.append(df_d, ignore_index=True)
#ar.append(df_d)
推荐答案
我注意到了您的问题中的一个问题: df-c的日期列从何而来?同样,ID cd15和 cd193不在两个数据框中。
目前,我使用以下数据框:
df-a:
I noticed one issue in your question: where do the dates in the "Date" column of df-c come from? Also, IDs 'cd15' and 'cd193' are not in either dataframe.
For now, I used the following dataframes:
df-a:
ID Start_Date End_Date
0 cd2 2020-06-01 2020-06-11
1 cd2 2020-06-24 2020-07-21
2 cd56 2020-06-10 2020-07-03
3 cd915 2020-04-28 2020-07-21
4 cd103 2020-04-13 2020-04-24
df-b:
ID Start_Date End_Date A B C D E
0 cd2 2020-06-01 2020-06-24 a b c 10 20
1 cd2 2020-06-24 2020-07-21 0 0 0 0 0
2 cd56 2020-06-10 2020-07-03 0 0 0 0 0
3 cd915 2020-04-28 2020-07-21 0 0 0 0 0
4 cd103 2020-04-13 2020-04-24 0 0 0 0 0
df-c(在进行任何串联操作之前):
df-c (before any concatenation operations):
ID Date Start_Date End_Date A B C D E
0 cd2 2020-05-12 0.0 0.0 0.0 0.0 0.0 0.0 0.0
1 cd2 2020-04-12 0.0 0.0 0.0 0.0 0.0 0.0 0.0
2 cd2 2020-06-10 0.0 0.0 0.0 0.0 0.0 0.0 0.0
3 cd15 2020-04-28 0.0 0.0 0.0 0.0 0.0 0.0 0.0
4 cd193 2020-04-13 0.0 0.0 0.0 0.0 0.0 0.0 0.0
以下是在df-c中执行条件级联的代码
for i in range(df_c.shape[0]):
currentID = df_c.ID[i]
currentDate = df_c.Date[i]
df_a_entriesForCurrentID = df_a.loc[df_a.ID == currentID]
for j in range(df_a_entriesForCurrentID.shape[0]):
startDate = df_a_entriesForCurrentID.iloc[j,:].Start_Date
endDate = df_a_entriesForCurrentID.iloc[j,:].End_Date
if (startDate <= currentDate <= endDate):
# Get A-E column values for the particular entry
fullIDRow = df_b.loc[df_b.Start_Date == startDate]
aVal = fullIDRow.A[0]
bVal = fullIDRow.B[0]
cVal = fullIDRow.C[0]
dVal = fullIDRow.D[0]
eVal = fullIDRow.E[0]
# Add all the values (including the start/end dates) to df_c
df_c.at[i, 'Start_Date'] = startDate.strftime('%Y-%m-%d')
df_c.at[i, 'End_Date'] = endDate.strftime('%Y-%m-%d')
df_c.at[i, 'A'] = aVal
df_c.at[i, 'B'] = bVal
df_c.at[i, 'C'] = cVal
df_c.at[i, 'D'] = dVal
df_c.at[i, 'E'] = eVal
输出(df_c)如下:
The output (df_c) is as follows:
ID Date Start_Date End_Date A B C D E
0 cd2 2020-05-12 0.0 0.0 0.0 0.0 0.0 0.0 0.0
1 cd2 2020-04-12 0.0 0.0 0.0 0.0 0.0 0.0 0.0
2 cd2 2020-06-10 2020-06-01 2020-06-11 a b c 10 20
3 cd15 2020-04-28 0.0 0.0 0.0 0.0 0.0 0.0 0.0
4 cd193 2020-04-13 0.0 0.0 0.0 0.0 0.0 0.0 0.0
这篇关于根据多个条件合并两个数据帧的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!