如何根据最近的日期合并两个数据框 [英] How to merge two data frames based on nearest date

查看:150
本文介绍了如何根据最近的日期合并两个数据框的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想根据两列代码和日期合并两个数据框。合并基于代码的数据帧是很简单的,但是在日期的情况下,它变得棘手 - 在df1和df2中的日期之间没有完全匹配。所以,我想选择最近的日期。如何做到这一点?

I want to merge two data frames based on two columns: "Code" and "Date". It is straightforward to merge data frames based on "Code", however in case of "Date" it becomes tricky - there is no exact match between Dates in df1 and df2. So, I want to select closest Dates. How can I do this?

df = df1[column_names1].merge(df2[column_names2], on='Code')


推荐答案

我不认为有一个快速,以这种方式做这种事情,但我相信最好的方法是这样做:

I don't think there's a quick, one-line way to do this kind of thing but I belive the best approach is to do it this way:


  1. 添加一列 df1 df2

在这些

随着数据大小的增长,这个最近的日期除非你做一些复杂的操作,否则操作变得相当昂贵。我喜欢使用scikit学习的 NearestNeighbor 代码对于这样的事情。

As the size of your data grows, this "closest date" operation can become rather expensive unless you do something sophisticated. I like to use scikit-learn's NearestNeighbor code for this sort of thing.

我已经把一个解决方案的方法放在一起比较好。
首先我们可以生成一些简单的数据:

I've put together one approach to that solution that should scale relatively well. First we can generate some simple data:

import pandas as pd
import numpy as np
dates = pd.date_range('2015', periods=200, freq='D')

rand = np.random.RandomState(42)
i1 = np.sort(rand.permutation(np.arange(len(dates)))[:5])
i2 = np.sort(rand.permutation(np.arange(len(dates)))[:5])

df1 = pd.DataFrame({'Code': rand.randint(0, 2, 5),
                    'Date': dates[i1],
                    'val1':rand.rand(5)})
df2 = pd.DataFrame({'Code': rand.randint(0, 2, 5),
                    'Date': dates[i2],
                    'val2':rand.rand(5)})

让我们检查一下:

>>> df1
   Code       Date      val1
0     0 2015-01-16  0.975852
1     0 2015-01-31  0.516300
2     1 2015-04-06  0.322956
3     1 2015-05-09  0.795186
4     1 2015-06-08  0.270832

>>> df2
   Code       Date      val2
0     1 2015-02-03  0.184334
1     1 2015-04-13  0.080873
2     0 2015-05-02  0.428314
3     1 2015-06-26  0.688500
4     0 2015-06-30  0.058194

现在我们来写一个应用函数,使用scikit-learn来添加最近日期列: df1

Now let's write an apply function that adds a column of nearest dates to df1 using scikit-learn:

from sklearn.neighbors import NearestNeighbors

def find_nearest(group, match, groupname):
    match = match[match[groupname] == group.name]
    nbrs = NearestNeighbors(1).fit(match['Date'].values[:, None])
    dist, ind = nbrs.kneighbors(group['Date'].values[:, None])

    group['Date1'] = group['Date']
    group['Date'] = match['Date'].values[ind.ravel()]
    return group

df1_mod = df1.groupby('Code').apply(find_nearest, df2, 'Code')
>>> df1_mod
   Code       Date      val1      Date1
0     0 2015-05-02  0.975852 2015-01-16
1     0 2015-05-02  0.516300 2015-01-31
2     1 2015-04-13  0.322956 2015-04-06
3     1 2015-04-13  0.795186 2015-05-09
4     1 2015-06-26  0.270832 2015-06-08

最后,我们可以将这些一起直接调用到 pd.merge

Finally, we can merge these together with a straightforward call to pd.merge:

>>> pd.merge(df1_mod, df2, on=['Code', 'Date'])
   Code       Date      val1      Date1      val2
0     0 2015-05-02  0.975852 2015-01-16  0.428314
1     0 2015-05-02  0.516300 2015-01-31  0.428314
2     1 2015-04-13  0.322956 2015-04-06  0.080873
3     1 2015-04-13  0.795186 2015-05-09  0.080873
4     1 2015-06-26  0.270832 2015-06-08  0.688500

请注意行0和1都匹配相同的 val2 ;根据您描述所需解决方案的方式,这是预期的。

Notice that rows 0 and 1 both matched the same val2; this is expected given the way you described your desired solution.

这篇关于如何根据最近的日期合并两个数据框的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆