Pandas:在一列上近似连接,在其他列上完全匹配 [英] Pandas: Approximate join on one column, exact match on other columns

查看:90
本文介绍了Pandas:在一列上近似连接,在其他列上完全匹配的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有两个 Pandas 数据框,我想在多列(比如 3)和大约(即最近的邻居)上的一个(日期)列上加入/合并.我也想返回它们之间的差异(天).每个数据集大约有 50,000 行长.我对内连接最感兴趣,但剩菜"也很有趣,如果不是太难掌握的话.大多数完全匹配"观测值将在每个数据框中多次出现.

I have two pandas dataframes I want to join/merge exactly on a number of columns (say 3) and approximately, i.e nearest neighbour, on one (date) column. I also want to return the difference (days) between them. Each dataset is about 50,000 rows long. I'm most interested in an inner join, but the "leftovers" are also interesting if not too hard to get hold of. Most of the "exact match" observations will exist multiple times in each data frame.

我一直在尝试使用 difflib.get_close_matches 将它们全部连接为字符串(这很愚蠢,我知道!)但并不总是给出完全匹配.我想我需要先遍历精确匹配项,然后在该组中找到最近的匹配项,但我似乎无法正确...

I've been trying to use difflib.get_close_matches on the concatenation of all of them as strings (which is stupid, I know!) but is does not always give exact matches. I suppose I need to loop over the exact matches first and then find nearest matches within this group, but I just can't seem to get it right...

数据框看起来像:

df1 = pd.DataFrame({'index': ['a1','a2','a3','a4'], 'col1': ['1232','432','432','123'], 'col2': ['asd','dsa12','dsa12','asd2'], 'col3': ['1','2','2','3'], 'date': ['2010-01-23','2016-05-20','2010-06-20','2008-10-21'],}).set_index('index')

df1
Out[430]: 
       col1   col2 col3        date
index                              
a1     1232    asd    1  2010-01-23
a2      432  dsa12    2  2016-05-20
a3      432  dsa12    2  2010-06-20
a4      123   asd2    3  2008-10-21

df2 = pd.DataFrame({'index': ['b1','b2','b3','b4'], 'col1': ['132','432','432','123'], 'col2': ['asd','dsa12','dsa12','sd2'], 'col3': ['1','2','2','3'], 'date': ['2010-01-23','2016-05-23','2010-06-10','2008-10-21'],}).set_index('index')

df2
Out[434]: 
      col1   col2 col3        date
index                             
b1     132    asd    1  2010-01-23
b2     432  dsa12    2  2016-05-23
b3     432  dsa12    2  2010-06-10
b4     123    sd2    3  2008-10-21

最后我想要的是:

       col1   col2 col3        date diff match_index
index                              
a1     1232    asd    1  2010-01-23  nan         nan
a2      432  dsa12    2  2016-05-20   -3          b2
a3      432  dsa12    2  2010-06-20   10          b3
a4      123   asd2    3  2008-10-21  nan         nan
a5      123    sd2    3  2008-10-21  nan          b4

或者如果仅使用内部连接更容易,我想要:

or if it's way easier with just an inner join I'd like:

       col1   col2 col3        date diff match_index
index                                                     
a2      432  dsa12    2  2016-05-20   -3          b2
a3      432  dsa12    2  2010-06-20   10          b3

推荐答案

我不确定这是否适合.它或多或少地实现了您想要的,但实际上并没有执行合并.它遵循与此问题相同的想法 除了不是仅基于一列对 df1 进行子集化,这里我们使用 groupby 匹配多个列,并在两个数据帧上进行匹配.如果您确实想明确包含 merge 命令并且对内部联接感到满意,请检查答案的最底部,它包含一个片段.

I am not sure if this suits. It achieves more or less what you want but does not actually perform a merge. It follows the same idea as this question except instead of subsetting the df1 based on only one column, here we match on multiple columns using a groupby and do it on both dataframes. IF you do want to explicitly include the merge command and are happy with an inner join then check the very bottom of the answer, it includes a snippet for that.

import pandas as pd
from sklearn.neighbors import NearestNeighbors

    def find_nearest(group, df2, groupname):
        try:
            match = df2.groupby(groupname).get_group(group.name)
            match['date'] = pd.to_datetime(match.date, unit = 'D')
            nbrs = NearestNeighbors(1).fit(match['date'].values[:, None])
            dist, ind = nbrs.kneighbors(group['date'].values[:, None])

            group['date1'] = group['date']
            group['date'] = match['date'].values[ind.ravel()]
            group['diff'] = (group['date1']-group['date'])
            group['match_index'] = match.index[ind.ravel()]
            return group
        except KeyError:
            return group

    #change dates from string to datetime
    df1['date'] = pd.to_datetime(df1.date, unit = 'D')
    df2['date'] = pd.to_datetime(df2.date, unit = 'D')

    #find closest dates and differences
    keys = ['col1', 'col2', 'col3']
    df1_mod = df1.groupby(keys).apply(find_nearest, df2, keys)

    #fill unmatched dates 
    df1_mod.date1.fillna(df1_mod.date, inplace=True)

    df2_mod = df2.groupby(keys).apply(find_nearest, df1, keys) 
    df2_mod.date1.fillna(df2_mod.date, inplace=True)

    #drop original column 
    df1_mod.drop('date', inplace=True, axis=1)
    df1_mod.rename(columns = {'date1':'date'}, inplace=True)

    df2_mod.drop('date', inplace=True, axis=1)
    df2_mod.rename(columns = {'date1':'date'}, inplace=True)
    df2_mod['diff'] = -df2_mod['diff']

    #drop redundant values
    df2_mod.drop(df2_mod[df2_mod.match_index.str.len()>0].index, inplace=True)

    #merge the two 
    df_final = pd.merge(df1_mod, df2_mod, how='outer')

结果如下:

In [349]: df_final
Out[349]:
   col1   col2 col3       date    diff match_index
0  1232    asd    1 2010-01-23     NaT         NaN
1   432  dsa12    2 2016-05-20 -3 days          b2
2   432  dsa12    2 2010-06-20 10 days          b3
3   123   asd2    3 2008-10-21     NaT         NaN
4   132    asd    1 2010-01-23     NaT         NaN
5   123    sd2    3 2008-10-21     NaT         NaN

使用合并命令:

In [208]: pd.merge(df1_mod, df2.drop('date', axis=1), on=['col1', 'col2', 'col3']).drop_duplicates()
Out[208]:
  col1   col2 col3       date    diff match_index
0  432  dsa12    2 2016-05-20 -3 days          b2
2  432  dsa12    2 2010-06-20 10 days          b3

评论中考虑的案例,即:

The case considered in the comments, namely:

df1 = pd.DataFrame({'index': ['a1','a2','a3','a4'], 'col1': ['1232','1432','432','123'], 'col2': ['asd','dsa12','dsa12','asd2'], 'col3': ['1','2','2','3'], 'date': ['2010-01-23','2016-05-20','2010-06-20','2008-10-21'],}).set_index('index')

产生以下结果:

In [351]: df_final
Out[351]:
   col1   col2 col3       date    diff match_index
0  1232    asd    1 2010-01-23     NaT         NaN
1  1432  dsa12    2 2016-05-20     NaT         NaN
2   432  dsa12    2 2010-06-20 10 days          b3
3   123   asd2    3 2008-10-21     NaT         NaN
4   132    asd    1 2010-01-23     NaT         NaN
5   123    sd2    3 2008-10-21     NaT         NaN

这篇关于Pandas:在一列上近似连接,在其他列上完全匹配的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆