pandas 合并名称和最近的日期 [英] Pandas Merge on Name and Closest Date

查看:144
本文介绍了 pandas 合并名称和最近的日期的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试在名称和最近的日期(WRT左侧的数据框)上合并两个数据帧。在我的研究中,我发现了一个类似的问题:这里,但它也不包括名称。从上面的问题来看,似乎没有办法通过合并来做到这一点,但是我看不到另外一种方式来执行不使用大熊猫合并功能的两个参数join。



有没有办法合并?如果不是这样做的适当方式呢?



我将发布一个我尝试过的副本,但是这是在日期完全合并后无法正常工作的。最重要的一行是最后一个我在data3数据框中的行。

  data = pd.read_csv(edgar14Afacts.csv ,parse_dates = {dater:[2]},infer_datetime_format = True)
data2 = pd.read_csv(sdcmergersdata.csv,parse_dates = {dater:[17]},infer_datetime_format = True)
list(data2.columns.values)

data2.rename(columns = lambda x:x.replace('\r\\\
',''),inplace = True)
data2.rename(columns = lambda x:x.replace('\\\
',''),inplace = True)
data2.rename(columns = lambda x:x.replace('\ r',''),inplace = True
data2 = data2.rename(columns = {'Acquiror Name':'name'})
data2 = data2.rename(columns = {'dater' :'date'})
data = data.rename(columns = {'dater':'date'})

列表(data2.columns.values)

data [name] = data ['name']。map(str.lower)
data2 [name] = data2 ['name']。map(str.lower)
data2 ['date']。fillna(method ='pad')
data ['namer1'] = da ta ['name']
data ['dater1'] = data ['date']
data2 ['namer2'] = data2 ['name']
data2 ['dater2'] = data2 ['date']

print data.head()
print data2.head()
data ['name'] = data ['name']。map (lambda x:str(x)[:4])
data2 ['name'] = data2 ['name']。map(lambda x:str(x)[:4])

data3 = pd.merge(data,data2,how ='left',on = ['date','name'])
data3.to_csv(check.csv)


解决方案

我也很想看到你想出来的最终解决方案



找到最近日期的一件事可能是计算第一个DataFrame中每个日期之间的天数和第二个DataFrame中的日期。然后,您可以使用 np.argmin 来检索具有最小时间增量的日期。



例如: p>

设置

 #!/ usr / bin / env python 
# - * - 编码:utf-8 - * -
import numpy as np
import pandas as pd
from pandas.io.parsers import StringIO

数据

  a =timepoint,measure 
2014-01-01 00:00:00,78
2014-01-02 00:00:00,29
2014-01-03 00:00:00,5
2014-01-04 00:00:00,73
2014-01-05 00:00:00,40
2014-01-06 00:00:00,45
2014-01-07 00:00:00,48
2014-01-08 00:00:00,2
2014 -01-09 00:00:00,96
2014-01-10 00:00:00,82
2014-01-11 00:00:00,61
2014-01 -12 00:00:00,68
2014-01-13 00:00:00,8
2014-01-14 00:00:00,94
2014-01-15 00:00:00,16
2014-01-16 00:00:00,31
2014-01-17 00:00:00,10
2014-01-18 00: 00:00,34
2014-01-19 00:00:00,27
2014-01-20 00:00:00,58
2014-01-21 00:00:00,90
2014-01-22 00:00:00,41
2014-01-23 00:00:00,97
2014-01-24 00:00:00,7
2014-01-25 00:00:00,86
2014-01-26 00:00:00,62
2014-01-27 00:00:00,91
2014-01-28 00:00:00,0
2014-01-29 00:00:00,73
2014- 01-30 00:00:00,22
2014-01-31 00:00:00,43
2014-02-01 00:00:00,87
2014-02- 02 00:00:00,56
2014-02-03 00:00:00,45
2014-02-04 00:00:00,25
2014-02-05 00 :00:00,92
2014-02-06 00:00:00,83
2014-02-07 00:00:00,13
2014-02-08 00:00 :00,50
2014-02-09 00:00:00,48
2014-02-10 00:00:00,78

b = timepoint,measure
2014-01-01 00:00:00,78
2014-01-08 00:00:00,29
2014-01-15 00:00:00 ,5
2014-01-22 00:00:00,73
2014-01-29 00:00:00,40
2014-02-05 00:00:00,45
2014-02-12 00:00:00,48
2014-02-19 00:00:00,2
2014-02-26 00:00:00,96
2014-03-05 00:00:00,82
2014-03-12 00:00:00,61
2014-03-19 00:00:00,68
2014-03-26 00:00:00,8
2014-04-0 2 00:00:00,94

查看数据

  df1 = pd.read_csv(StringIO(a),parse_dates = ['timepoint'])
df1.head()

timepoint measure
0 2014-01-01 78
1 2014-01-02 29
2 2014-01-03 5
3 2014-01-04 73
4 2014-01-05 40

df2 = pd.read_csv(StringIO(b),parse_dates = ['timepoint'])
df2.head()

timepoint measure
0 2014-01-01 78
1 2014-01-08 29
2 2014-01-15 5
3 2014-01-22 73
4 2014-01-29 40

Func找到指定日期的最近日期

  def find_closest_date(timepoint,time_series,add_time_delta_column = true):
#需要一个pd.Timestamp()实例和一个带有日期的pd.Series
#计算`timepoint`和`time_series`中的每个日期之间的差值
#返回最近日期和可选的时间的天数delta
deltas = np.abs(time_series - timepoint)
idx_closest_date = np.argmin(deltas)
res = {nearest_date:time_series。 ix [idx_closest_date]}
idx = ['nearest_date']
如果add_time_delta_column:
res [nearest_delta] = deltas [idx_closest_date]
idx.append('nearest_delta'
return pd.Series(res,index = idx)

df1 [['最近','days_bt_x_and_y']] = df1.timepoint.apply(
find_closest_date,args = [df2.timepoint])
df1.head(10)

时间点测量最近days_bt_x_and_y
0 2014-01-01 78 2014-01-01 0天
1 2014-01-02 29 2014-01-01 1天$ ​​b $ b 2 2014-01-03 5 2014-01-01 2天
3 2014-01-04 73 2014-01-01 3天
4 2014-01-05 40 2014-01-08 3天
5 2014-01-06 45 2014-01-08 2天
6 2014-01-07 48 2014-01-08 1天$ ​​b $ b 7 2014-01-08 2 2014-01-08 0天
8 2014-01-09 96 2014 -01-08 1天$ ​​b $ b 9 2014-01-10 82 2014-01-08 2天

将两个DataFrames合并到最近的最近日期列

  df3 = pd.merge(df1,df2,left_on = ['nearest'],right_on = ['timepoint'])

colorder = [
'timepoint_x ',
'最近',
'timepoint_y',
'days_bt_x_and_y',
'measure_x',
'measure_y'
]

df3 = df3.ix [:, colorder]
df3

timepoint_x最近的timepoint_y days_bt_x_and_y measure_x measure_y
0 2014-01-01 2014-01-01 2014- 01-01 0天78 78
1 2014-01-02 2014-01-01 2014-01-01 1天29 78
2 2014-01-03 2014-01-01 2014-01- 01 2天5 78
3 2014-01-04 2014-01-01 2014-01-01 3天73 78
4 2014-01-05 2014-01-08 2014-01-08 3天40 29
5 2014-01-06 2014-01-08 2014-01-08 2天45 29
6 2014-01-07 2014-01-08 2014-01-08 1天48 29
7 2014-01-08 2014-01-08 2014-01-08 0天2 29
8 2014-01-09 2014-01-08 2014-01-08 1天96 29
9 2014-01-10 2014-01-08 2014-01-08 2天82 29
10 2014-01-11 2014-01-08 2014-01-08 3天61 29
11 2014-01-12 2014-01-15 2014-01-15 3天68 5
12 2014-01-13 2014-01-15 2014-01-15 2天8 5
13 2014 -01-14 2014-01-15 2014-01-15 1天94 5
14 2014-01-15 2014-01-15 2014-01-15 0天16 5
15 2014-01-16 2014-01-15 2014-01-15 1天31 5
16 2014-01-17 2014-01-15 2014-01-15 2天10 5
17 2014-01-18 2014-01-15 2014-01-15 3天34 5
18 2014-01-19 2014-01-22 2014-01-22 3天27 73
19 2014-01-20 2014-01-22 2014-01-22 2天58 73
20 2014-01-21 2014-01-22 2014-01-22 1天90 73
21 2014-01-22 2014-01-22 2014-01-22 0天41 73
22 2014-01-23 2014-01-22 2014-01-22 1天97 73
23 2014 -01-24 2014-01-22 2014-01-22 2天7 73
24 2014-01-25 2014-01-22 2014-01-22 3天86 73
25 2014-01 -26 2014-01-29 2014-01-29 3天62 40
26 2014-01-27 2014-01-29 2014-01-29 2天91 40
27 2014-01-28 2014-01-29 2014-01-29 1天0 40
28 2014-01-29 2014-01-29 2014-01-29 0天73 40
29 2014-01-30 2014-01-29 2014-01-29 1天22 40
30 2014-01-31 2014-01-29 2014-01-29 2天43 40
31 2014-02-01 2014-01-29 2014-01-29 3天87 40
32 2014-02-02 2014-02-05 2014-02-05 3天56 45
33 2014 -02-03 2014-02-05 2014-02-05 2天45 45
34 2014-02-04 2014-02-05 2014-02-05 1天25 45
35 2014-02 -05 2014-02-05 2014-02-05 0天92 45
36 2014-02-06 2014-02-05 2014-02-05 1天83 45
37 2014-02-07 2014-02-05 2014-02-05 2天13 45
38 2014-02-08 2014-02-05 2014-02-05 3天50 45
39 2014-0 2-09 2014-02-12 2014-02-12 3天48 48
40 2014-02-10 2014-02-12 2014-02-12 2天78 48


I am trying to merge two dataframes on both name and the closest date (WRT the left hand dataframe). In my research I found one similar question here but it doesn't account for the name as well. From the above question it doesn't seem like there is a way to do this with merge but I can't see another way to do the two argument join that doesn't use the pandas merge function.

Is there a way to do this with merge? And if not what would be the appropriate way to do this?

I will post a copy of what I have tried but this was trying it with an exact merge on date which will not work. The most important line is the last one where I make the data3 dataframe.

data=pd.read_csv("edgar14Afacts.csv", parse_dates={"dater": [2]}, infer_datetime_format=True)
data2=pd.read_csv("sdcmergersdata.csv", parse_dates={"dater": [17]}, infer_datetime_format=True)
list(data2.columns.values)

data2.rename(columns=lambda x: x.replace('\r\n', ''), inplace=True)
data2.rename(columns=lambda x: x.replace('\n', ''), inplace=True)
data2.rename(columns=lambda x: x.replace('\r', ''), inplace=True)
data2=data2.rename(columns = {'Acquiror Name':'name'})
data2=data2.rename(columns = {'dater':'date'})
data=data.rename(columns = {'dater':'date'})

list(data2.columns.values)

data["name"]=data['name'].map(str.lower)
data2["name"]=data2['name'].map(str.lower)
data2['date'].fillna(method='pad')
data['namer1']=data['name']
data['dater1']=data['date']
data2['namer2']=data2['name']
data2['dater2']=data2['date']

print data.head()
print data2.head()
data['name'] = data['name'].map(lambda x: str(x)[:4])
data2['name'] = data2['name'].map(lambda x: str(x)[:4])

data3 = pd.merge(data, data2, how='left', on=['date','name'])
data3.to_csv("check.csv")

解决方案

I'd also love to see the final solution you came up with to know how it shook out in the end.

One thing you can do to find the closest date might be something to calc the number of days between each date in the first DataFrame and the dates in the second DataFrame. Then you can use np.argmin to retrieve the date with the smallest time delta.

For example:

Setup

#!/usr/bin/env python
# -*- coding: utf-8 -*- 
import numpy as np
import pandas as pd
from pandas.io.parsers import StringIO

Data

a = """timepoint,measure
2014-01-01 00:00:00,78
2014-01-02 00:00:00,29
2014-01-03 00:00:00,5
2014-01-04 00:00:00,73
2014-01-05 00:00:00,40
2014-01-06 00:00:00,45
2014-01-07 00:00:00,48
2014-01-08 00:00:00,2
2014-01-09 00:00:00,96
2014-01-10 00:00:00,82
2014-01-11 00:00:00,61
2014-01-12 00:00:00,68
2014-01-13 00:00:00,8
2014-01-14 00:00:00,94
2014-01-15 00:00:00,16
2014-01-16 00:00:00,31
2014-01-17 00:00:00,10
2014-01-18 00:00:00,34
2014-01-19 00:00:00,27
2014-01-20 00:00:00,58
2014-01-21 00:00:00,90
2014-01-22 00:00:00,41
2014-01-23 00:00:00,97
2014-01-24 00:00:00,7
2014-01-25 00:00:00,86
2014-01-26 00:00:00,62
2014-01-27 00:00:00,91
2014-01-28 00:00:00,0
2014-01-29 00:00:00,73
2014-01-30 00:00:00,22
2014-01-31 00:00:00,43
2014-02-01 00:00:00,87
2014-02-02 00:00:00,56
2014-02-03 00:00:00,45
2014-02-04 00:00:00,25
2014-02-05 00:00:00,92
2014-02-06 00:00:00,83
2014-02-07 00:00:00,13
2014-02-08 00:00:00,50
2014-02-09 00:00:00,48
2014-02-10 00:00:00,78"""

b = """timepoint,measure
2014-01-01 00:00:00,78
2014-01-08 00:00:00,29
2014-01-15 00:00:00,5
2014-01-22 00:00:00,73
2014-01-29 00:00:00,40
2014-02-05 00:00:00,45
2014-02-12 00:00:00,48
2014-02-19 00:00:00,2
2014-02-26 00:00:00,96
2014-03-05 00:00:00,82
2014-03-12 00:00:00,61
2014-03-19 00:00:00,68
2014-03-26 00:00:00,8
2014-04-02 00:00:00,94
"""

look at data

df1 = pd.read_csv(StringIO(a), parse_dates=['timepoint'])
df1.head()

   timepoint  measure
0 2014-01-01       78
1 2014-01-02       29
2 2014-01-03        5
3 2014-01-04       73
4 2014-01-05       40

df2 = pd.read_csv(StringIO(b), parse_dates=['timepoint'])
df2.head()

   timepoint  measure
0 2014-01-01       78
1 2014-01-08       29
2 2014-01-15        5
3 2014-01-22       73
4 2014-01-29       40

Func to find the closest date to a given date

def find_closest_date(timepoint, time_series, add_time_delta_column=True):
    # takes a pd.Timestamp() instance and a pd.Series with dates in it
    # calcs the delta between `timepoint` and each date in `time_series`
    # returns the closest date and optionally the number of days in its time delta
    deltas = np.abs(time_series - timepoint)
    idx_closest_date = np.argmin(deltas)
    res = {"closest_date": time_series.ix[idx_closest_date]}
    idx = ['closest_date']
    if add_time_delta_column:
        res["closest_delta"] = deltas[idx_closest_date]
        idx.append('closest_delta')
    return pd.Series(res, index=idx)

df1[['closest', 'days_bt_x_and_y']] = df1.timepoint.apply(
                                          find_closest_date, args=[df2.timepoint])
df1.head(10)

   timepoint  measure    closest  days_bt_x_and_y
0 2014-01-01       78 2014-01-01           0 days
1 2014-01-02       29 2014-01-01           1 days
2 2014-01-03        5 2014-01-01           2 days
3 2014-01-04       73 2014-01-01           3 days
4 2014-01-05       40 2014-01-08           3 days
5 2014-01-06       45 2014-01-08           2 days
6 2014-01-07       48 2014-01-08           1 days
7 2014-01-08        2 2014-01-08           0 days
8 2014-01-09       96 2014-01-08           1 days
9 2014-01-10       82 2014-01-08           2 days

Merge the two DataFrames on the new closest date column

df3 = pd.merge(df1, df2, left_on=['closest'], right_on=['timepoint'])

colorder = [
    'timepoint_x',
    'closest',
    'timepoint_y',
    'days_bt_x_and_y',
    'measure_x',
    'measure_y'
]

df3 = df3.ix[:, colorder]
df3

   timepoint_x    closest timepoint_y  days_bt_x_and_y  measure_x  measure_y
0   2014-01-01 2014-01-01  2014-01-01           0 days         78         78
1   2014-01-02 2014-01-01  2014-01-01           1 days         29         78
2   2014-01-03 2014-01-01  2014-01-01           2 days          5         78
3   2014-01-04 2014-01-01  2014-01-01           3 days         73         78
4   2014-01-05 2014-01-08  2014-01-08           3 days         40         29
5   2014-01-06 2014-01-08  2014-01-08           2 days         45         29
6   2014-01-07 2014-01-08  2014-01-08           1 days         48         29
7   2014-01-08 2014-01-08  2014-01-08           0 days          2         29
8   2014-01-09 2014-01-08  2014-01-08           1 days         96         29
9   2014-01-10 2014-01-08  2014-01-08           2 days         82         29
10  2014-01-11 2014-01-08  2014-01-08           3 days         61         29
11  2014-01-12 2014-01-15  2014-01-15           3 days         68          5
12  2014-01-13 2014-01-15  2014-01-15           2 days          8          5
13  2014-01-14 2014-01-15  2014-01-15           1 days         94          5
14  2014-01-15 2014-01-15  2014-01-15           0 days         16          5
15  2014-01-16 2014-01-15  2014-01-15           1 days         31          5
16  2014-01-17 2014-01-15  2014-01-15           2 days         10          5
17  2014-01-18 2014-01-15  2014-01-15           3 days         34          5
18  2014-01-19 2014-01-22  2014-01-22           3 days         27         73
19  2014-01-20 2014-01-22  2014-01-22           2 days         58         73
20  2014-01-21 2014-01-22  2014-01-22           1 days         90         73
21  2014-01-22 2014-01-22  2014-01-22           0 days         41         73
22  2014-01-23 2014-01-22  2014-01-22           1 days         97         73
23  2014-01-24 2014-01-22  2014-01-22           2 days          7         73
24  2014-01-25 2014-01-22  2014-01-22           3 days         86         73
25  2014-01-26 2014-01-29  2014-01-29           3 days         62         40
26  2014-01-27 2014-01-29  2014-01-29           2 days         91         40
27  2014-01-28 2014-01-29  2014-01-29           1 days          0         40
28  2014-01-29 2014-01-29  2014-01-29           0 days         73         40
29  2014-01-30 2014-01-29  2014-01-29           1 days         22         40
30  2014-01-31 2014-01-29  2014-01-29           2 days         43         40
31  2014-02-01 2014-01-29  2014-01-29           3 days         87         40
32  2014-02-02 2014-02-05  2014-02-05           3 days         56         45
33  2014-02-03 2014-02-05  2014-02-05           2 days         45         45
34  2014-02-04 2014-02-05  2014-02-05           1 days         25         45
35  2014-02-05 2014-02-05  2014-02-05           0 days         92         45
36  2014-02-06 2014-02-05  2014-02-05           1 days         83         45
37  2014-02-07 2014-02-05  2014-02-05           2 days         13         45
38  2014-02-08 2014-02-05  2014-02-05           3 days         50         45
39  2014-02-09 2014-02-12  2014-02-12           3 days         48         48
40  2014-02-10 2014-02-12  2014-02-12           2 days         78         48

这篇关于 pandas 合并名称和最近的日期的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆