pandas 根据时差合并两个数据帧 [英] Pandas combine two dataframes based on time difference

查看:41
本文介绍了 pandas 根据时差合并两个数据帧的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有两个数据框,用于存储不同类型的患者医疗信息.这两个数据帧的共同元素是相遇ID( hadm_id ),信息记录的时间((n | c)e_charttime ).

I have two data frames that stores different types of medical information of patients. The common elements of both the data frames are the encounter ID (hadm_id), the time the information was recorded ((n|c)e_charttime).

一个数据帧( df_str )包含结构信息,例如生命体征和实验室测试值以及从中得出的值(例如24小时内的变化统计信息).另一个数据框( df_notes )包含一列,其中在指定时间记录相遇时的临床笔记.这两个数据帧都包含多个遭遇,但是共同的元素是遭遇ID( hadm_id ).

One data frame (df_str) contains structured information such as vital signs and lab test values and values derived from these (such as change statistics over 24 hours). The other data frame (df_notes) contains a column with a clinical note recorded at a specified time for an encounter. Both these data frames contain multiple encounters, but the common element is the encounter ID (hadm_id).

以下是一个遭遇ID( hadm_id )带有变量子集的数据帧的示例:

Here are examples of the data frames for ONE encounter ID (hadm_id) with a subset of variables:

df_str
    hadm_id ce_charttime    hr  resp    magnesium   hr_24hr_mean
0   196673  2108-03-05 15:34:00 95.0    12.0    NaN 95.000000
1   196673  2108-03-05 16:00:00 85.0    11.0    NaN 90.000000
2   196673  2108-03-05 16:16:00 85.0    11.0    1.8 88.333333
3   196673  2108-03-05 17:00:00 109.0   12.0    1.8 93.500000
4   196673  2108-03-05 18:00:00 97.0    12.0    1.8 94.200000
5   196673  2108-03-05 19:00:00 99.0    16.0    1.8 95.000000
6   196673  2108-03-05 20:00:00 98.0    13.0    1.8 95.428571
7   196673  2108-03-05 21:00:00 97.0    14.0    1.8 95.625000
8   196673  2108-03-05 22:00:00 101.0   12.0    1.8 96.222222
9   196673  2108-03-05 23:00:00 97.0    13.0    1.8 96.300000
10  196673  2108-03-06 00:00:00 93.0    13.0    1.8 96.000000
11  196673  2108-03-06 01:00:00 89.0    12.0    1.8 95.416667
12  196673  2108-03-06 02:00:00 88.0    10.0    1.8 94.846154
13  196673  2108-03-06 03:00:00 87.0    12.0    1.8 94.285714
14  196673  2108-03-06 04:00:00 97.0    19.0    1.8 94.466667
15  196673  2108-03-06 05:00:00 95.0    11.0    1.8 94.500000
16  196673  2108-03-06 05:43:00 95.0    11.0    2.0 94.529412
17  196673  2108-03-06 06:00:00 103.0   17.0    2.0 95.000000
18  196673  2108-03-06 07:00:00 101.0   12.0    2.0 95.315789
19  196673  2108-03-06 08:00:00 103.0   20.0    2.0 95.700000
20  196673  2108-03-06 09:00:00 84.0    11.0    2.0 95.142857
21  196673  2108-03-06 10:00:00 89.0    11.0    2.0 94.863636
22  196673  2108-03-06 11:00:00 91.0    14.0    2.0 94.695652
23  196673  2108-03-06 12:00:00 85.0    10.0    2.0 94.291667
24  196673  2108-03-06 13:00:00 98.0    14.0    2.0 94.440000
25  196673  2108-03-06 14:00:00 100.0   18.0    2.0 94.653846
26  196673  2108-03-06 15:00:00 95.0    12.0    2.0 94.666667
27  196673  2108-03-06 16:00:00 96.0    20.0    2.0 95.076923
28  196673  2108-03-06 17:00:00 106.0   21.0    2.0 95.360000

df_notes
    hadm_id ne_charttime    note
0   196673  2108-03-05 16:54:00 Nursing\nNursing Progress Note\nPt is a 43 yo ...
1   196673  2108-03-05 17:54:00 Physician \nPhysician Resident Admission Note\...
2   196673  2108-03-05 18:09:00 Physician \nPhysician Resident Admission Note\...
3   196673  2108-03-06 06:11:00 Nursing\nNursing Progress Note\nPain control (...
4   196673  2108-03-06 08:06:00 Physician \nPhysician Resident Progress Note\n...
5   196673  2108-03-06 12:40:00 Nursing\nNursing Progress Note\nChief Complain...
6   196673  2108-03-06 13:01:00 Nursing\nNursing Progress Note\nPain control (...
7   196673  2108-03-06 17:09:00 Nursing\nNursing Transfer Note\nChief Complain...
8   196673  2108-03-06 17:12:00 Nursing\nNursing Transfer Note\nPain control (...
9   196673  2108-03-07 15:25:00 Radiology\nCHEST (PA & LAT)\n[**2108-3-7**] 3:...
10  196673  2108-03-07 18:34:00 Radiology\nCTA CHEST W&W/O C&RECONS, NON-CORON...
11  196673  2108-03-09 09:10:00 Radiology\nABDOMEN (SUPINE & ERECT)\n[**2108-3...
12  196673  2108-03-09 12:22:00 Radiology\nCT ABDOMEN W/CONTRAST\n[**2108-3-9*...
13  196673  2108-03-10 05:26:00 Radiology\nABDOMEN (SUPINE & ERECT)\n[**2108-3...
14  196673  2108-03-10 05:27:00 Radiology\nCHEST (PA & LAT)\n[**2108-3-10**] 5...

我要做的是根据记录信息的时间来合并两个数据帧.更具体地说,对于 df_notes 中的每一行,我想要 ce_charttime< = ne_charttime df_str 中的对应行.

What I want to do is to combine both the data frames based on the time when that information was recorded. More specifically, for each row in df_notes, I want a corresponding row from df_str with ce_charttime <= ne_charttime.

例如, df_notes 中的第一行具有 ne_charttime = 2108-03-05 16:54:00 . df_str 中有三行,记录时间少于此时间: ce_charttime = 2108-03-05 15:34:00,ce_charttime = 2108-03-05 16:00:00,ce_charttime = 2108-03-05 16:16:00 .其中最新的是 ce_charttime = 2108-03-05 16:16:00 的行.因此,在我得到的数据帧中,对于 ne_charttime = 2108-03-05 16:54:00 ,我将具有 hr = 85.0,resp = 11.0,镁= 1.8,hr_24hr_mean = 88.33 .

As an example, the first row in df_notes has ne_charttime = 2108-03-05 16:54:00. There are three rows in df_str with record times less than this time: ce_charttime = 2108-03-05 15:34:00, ce_charttime = 2108-03-05 16:00:00, ce_charttime = 2108-03-05 16:16:00. The most recent of these is the row with ce_charttime = 2108-03-05 16:16:00. So in my resulting data frame, for ne_charttime = 2108-03-05 16:54:00, I will have hr = 85.0, resp = 11.0, magnesium = 1.8, hr_24hr_mean = 88.33.

本质上,在此示例中,结果数据帧将如下所示:

Essentially, in this example the resulting data frame will look like this:

    hadm_id ne_charttime    note    hr  resp    magnesium   hr_24hr_mean
0   196673  2108-03-05 16:54:00 Nursing\nNursing Progress Note\nPt is a 43 yo ...   85.0    11.0    1.8 88.333333
1   196673  2108-03-05 17:54:00 Physician \nPhysician Resident Admission Note\...   109.0   12.0    1.8 93.500000
2   196673  2108-03-05 18:09:00 Physician \nPhysician Resident Admission Note\...   97.0    12.0    1.8 94.200000
3   196673  2108-03-06 06:11:00 Nursing\nNursing Progress Note\nPain control (...   103.0   17.0    2.0 95.000000
4   196673  2108-03-06 08:06:00 Physician \nPhysician Resident Progress Note\n...   103.0   20.0    2.0 95.700000
5   196673  2108-03-06 12:40:00 Nursing\nNursing Progress Note\nChief Complain...   85.0    10.0    2.0 94.291667
6   196673  2108-03-06 13:01:00 Nursing\nNursing Progress Note\nPain control (...   98.0    14.0    2.0 94.440000
7   196673  2108-03-06 17:09:00 Nursing\nNursing Transfer Note\nChief Complain...   106.0   21.0    2.0 95.360000
8   196673  2108-03-06 17:12:00 Nursing\nNursing Transfer Note\nPain control (...   NaN NaN NaN NaN
9   196673  2108-03-07 15:25:00 Radiology\nCHEST (PA & LAT)\n[**2108-3-7**] 3:...   NaN NaN NaN NaN
10  196673  2108-03-07 18:34:00 Radiology\nCTA CHEST W&W/O C&RECONS, NON-CORON...   NaN NaN NaN NaN
11  196673  2108-03-09 09:10:00 Radiology\nABDOMEN (SUPINE & ERECT)\n[**2108-3...   NaN NaN NaN NaN
12  196673  2108-03-09 12:22:00 Radiology\nCT ABDOMEN W/CONTRAST\n[**2108-3-9*...   NaN NaN NaN NaN
13  196673  2108-03-10 05:26:00 Radiology\nABDOMEN (SUPINE & ERECT)\n[**2108-3...   NaN NaN NaN NaN
14  196673  2108-03-10 05:27:00 Radiology\nCHEST (PA & LAT)\n[**2108-3-10**] 5...   NaN NaN NaN NaN

结果数据帧的长度将与 df_notes 相同.我已经能够使用 for 循环和显式索引来获得非常低效的代码,以得到以下结果:

The resulting data frame will be of the same length as df_notes. I have been able to come with a very inefficient piece of code using for loops and explicit indexing to get this result:

cols = list(df_str.columns[2:])

final_df = df_notes.copy()
for col in cols:
  final_df[col] = np.nan

idx = 0
for i, note_row in final_df.iterrows():
  ne = note_row['ne_charttime']
  for j, str_row in df_str.iterrows():
    ce = str_row['ce_charttime']
    if ne < ce:
      idx += 1
      for col in cols:
        final_df.iloc[i, final_df.columns.get_loc(col)] = df_str.iloc[j-1][col]
      break

for col in cols:
  final_df.iloc[idx, final_df.columns.get_loc(col)] = df_str.iloc[-1][col]

这段代码很糟糕,因为它效率很低,虽然可以在本示例中使用,但在我的示例数据集中,我有超过30列不同的结构化变量,并且遇到了10,000多次

This piece of code is bad because it is very inefficient and while it may work for this example, in my example dataset, I have over 30 different columns of structured variables, and over 10,000 encounters.

EDIT-2:@Stef提供了一个很好的答案,它似乎可以用单行代码代替我精心设计的循环代码(令人惊奇).但是,虽然这适用于此特定示例,但是当我将其应用于包含多次遇到的较大子集时,我遇到了问题.例如,考虑以下示例:

EDIT-2: @Stef has provided an excellent answer which seems to work and replace my elaborate loopy code with a single line (amazing). However, while that works for this particular example, I am running into problems when I apply it to a bigger subset which includes multiple encounters. For example, consider the following example:

df_str.shape, df_notes.shape
((217, 386), (35, 4))

df_notes[['hadm_id', 'ne_charttime']]
    hadm_id ne_charttime
0   100104  2201-06-21 20:00:00
1   100104  2201-06-21 22:51:00
2   100104  2201-06-22 05:00:00
3   100104  2201-06-23 04:33:00
4   100104  2201-06-23 12:59:00
5   100104  2201-06-24 05:15:00
6   100372  2115-12-20 02:29:00
7   100372  2115-12-21 10:15:00
8   100372  2115-12-22 13:05:00
9   100372  2115-12-25 17:16:00
10  100372  2115-12-30 10:58:00
11  100372  2115-12-30 13:07:00
12  100372  2115-12-30 14:16:00
13  100372  2115-12-30 22:34:00
14  100372  2116-01-03 09:10:00
15  100372  2116-01-07 11:08:00
16  100975  2126-03-02 06:06:00
17  100975  2126-03-02 17:44:00
18  100975  2126-03-03 05:36:00
19  100975  2126-03-03 18:27:00
20  100975  2126-03-04 05:29:00
21  100975  2126-03-04 10:48:00
22  100975  2126-03-04 16:42:00
23  100975  2126-03-05 22:12:00
24  100975  2126-03-05 23:01:00
25  100975  2126-03-06 11:02:00
26  100975  2126-03-06 13:38:00
27  100975  2126-03-08 13:39:00
28  100975  2126-03-11 10:41:00
29  101511  2199-04-30 09:29:00
30  101511  2199-04-30 09:53:00
31  101511  2199-04-30 18:06:00
32  101511  2199-05-01 08:28:00
33  111073  2195-05-01 01:56:00
34  111073  2195-05-01 21:49:00

此示例有5次遭遇.数据帧按 hadm_id 排序,并且在每个 hadm_id 中,对 ne_charttime 进行排序.但是,从行0 ce_charttime = 2201-06-21 20:00:00 和行6 ne_charttime = 2115来看,列 ne_charttime 本身未进行排序-12-20 02:29:00 .当我尝试执行 merge_asof 时,出现以下错误:

This example has 5 encounters. The dataframe is sorted by hadm_id and within each hadm_id, ne_charttime is sorted. However, the column ne_charttime by itself is NOT sorted as seen from row 0 ce_charttime=2201-06-21 20:00:00 and row 6 ne_charttime=2115-12-20 02:29:00. When I try to do a merge_asof, I get the following error:

ValueError:必须对左键进行排序.这是因为 ne_charttime 列未排序的事实吗?如果是这样,我如何在维护遇到ID组的完整性的同时进行纠正?

ValueError: left keys must be sorted. Is this because of the fact that ne_charttime column is not sorted? If so, how do I rectify this while maintaining the integrity of the encounter ID group?

EDIT-1:我也能够遍历这些相遇:

EDIT-1: I was able to loop over the encounters as well:

cols = list(dev_str.columns[1:]) # get the cols to merge (everything except hadm_id)
final_dfs = [] 

grouped = dev_notes.groupby('hadm_id') # get groups of encounter ids
for name, group in grouped:
  final_df = group.copy().reset_index(drop=True) # make a copy of notes for that encounter
  for col in cols:
    final_df[col] = np.nan # set the values to nan

  idx = 0 # index to track the final row in the given encounter
  for i, note_row in final_df.iterrows():
    ne = note_row['ne_charttime']
    sub = dev_str.loc[(dev_str['hadm_id'] == name)].reset_index(drop=True) # get the df corresponding to the ecounter
    for j, str_row in sub.iterrows():
      ce = str_row['ce_charttime']
      if ne < ce: # if the variable charttime < note charttime
        idx += 1

        # grab the previous values for the variables and break
        for col in cols:
          final_df.iloc[i, final_df.columns.get_loc(col)] = sub.iloc[j-1][col]          
        break               

  # get the last value in the df for the variables
  for col in cols:
    final_df.iloc[idx, final_df.columns.get_loc(col)] = sub.iloc[-1][col]

  final_dfs.append(final_df) # append the df to the list

# cat the list to get final df and reset index
final_df = pd.concat(final_dfs)
final_df.reset_index(inplace=True, drop=True)

再次遇到这种效率低下的问题,但是却能完成工作.

Again this very inefficient but does the job.

是否有更好的方法来实现我想要的?感谢您的帮助.

Is there a better way to achieve what I want? Any help is appreciated.

谢谢.

推荐答案

您可以使用

You can use merge_asof (both dataframes must be sorted by the columns you're merging them on, which is already the case in your example):

final_df = pd.merge_asof(df_notes, df_str, left_on='ne_charttime', right_on='ce_charttime', by='hadm_id')

结果:

    hadm_id        ne_charttime                                               note        ce_charttime     hr  resp  magnesium  hr_24hr_mean
0    196673 2108-03-05 16:54:00  Nursing\nNursing Progress Note\nPt is a 43 yo ... 2108-03-05 16:16:00   85.0  11.0        1.8     88.333333
1    196673 2108-03-05 17:54:00  Physician \nPhysician Resident Admission Note\... 2108-03-05 17:00:00  109.0  12.0        1.8     93.500000
2    196673 2108-03-05 18:09:00  Physician \nPhysician Resident Admission Note\... 2108-03-05 18:00:00   97.0  12.0        1.8     94.200000
3    196673 2108-03-06 06:11:00  Nursing\nNursing Progress Note\nPain control (... 2108-03-06 06:00:00  103.0  17.0        2.0     95.000000
4    196673 2108-03-06 08:06:00  Physician \nPhysician Resident Progress Note\n... 2108-03-06 08:00:00  103.0  20.0        2.0     95.700000
5    196673 2108-03-06 12:40:00  Nursing\nNursing Progress Note\nChief Complain... 2108-03-06 12:00:00   85.0  10.0        2.0     94.291667
6    196673 2108-03-06 13:01:00  Nursing\nNursing Progress Note\nPain control (... 2108-03-06 13:00:00   98.0  14.0        2.0     94.440000
7    196673 2108-03-06 17:09:00  Nursing\nNursing Transfer Note\nChief Complain... 2108-03-06 17:00:00  106.0  21.0        2.0     95.360000
8    196673 2108-03-06 17:12:00  Nursing\nNursing Transfer Note\nPain control (... 2108-03-06 17:00:00  106.0  21.0        2.0     95.360000
9    196673 2108-03-07 15:25:00  Radiology\nCHEST (PA & LAT)\n[**2108-3-7**] 3:... 2108-03-06 17:00:00  106.0  21.0        2.0     95.360000
10   196673 2108-03-07 18:34:00  Radiology\nCTA CHEST W&W/O C&RECONS, NON-CORON... 2108-03-06 17:00:00  106.0  21.0        2.0     95.360000
11   196673 2108-03-09 09:10:00  Radiology\nABDOMEN (SUPINE & ERECT)\n[**2108-3... 2108-03-06 17:00:00  106.0  21.0        2.0     95.360000
12   196673 2108-03-09 12:22:00  Radiology\nCT ABDOMEN W/CONTRAST\n[**2108-3-9*... 2108-03-06 17:00:00  106.0  21.0        2.0     95.360000
13   196673 2108-03-10 05:26:00  Radiology\nABDOMEN (SUPINE & ERECT)\n[**2108-3... 2108-03-06 17:00:00  106.0  21.0        2.0     95.360000
14   196673 2108-03-10 05:27:00  Radiology\nCHEST (PA & LAT)\n[**2108-3-10**] 5... 2108-03-06 17:00:00  106.0  21.0        2.0     95.360000

PS :这将为您提供所有行的正确结果.您的代码中存在一个逻辑缺陷:您第一次查找 ce_charttime > ne_charttime ,然后进行上一行.如果没有这样的时间,您将永远没有机会进入上一行,因此结果表中的 NaN s从第8行开始.

PS: This gives you the correct result for all rows. There's a logical flaw in your code: you look for the first time ce_charttime > ne_charttime and then take the previous row. If there's no such time, you'll never have the chance to take the previous row, hence the NaNs in your result table starting from row 8.

PPS :这在最终数据帧中包括 ce_charttime .您可以用一列信息的年代来替换它,并且/或者删除它:

PPS: This includes ce_charttime in the final dataframe. You can replace it by a column of how old the information is and/or remove it:

final_df['info_age'] = final_df.ne_charttime - final_df.ce_charttime
final_df = final_df.drop(columns='ce_charttime')

EDIT-2的更新:正如我在开始时所写的那样,在注释中重复并且文档清楚地指出: ce_charttime ne_charttime 必须排序( hadm_id 不需要排序).如果不满足此条件,则必须(临时)根据需要对数据帧进行排序.请参见以下示例:

UPDATE for EDIT-2: As I wrote at the very beginning, repeated in the comments and as the docs clearly states: both ce_charttime and ne_charttime must be sorted (hadm_id need not be sorted). If this condition is not met, you'll have to (temporarily) sort your dataframes as required. See the following example:

import pandas as pd, string

df_str = pd.DataFrame( {'hadm_id': pd.np.tile([111111, 222222],10), 'ce_charttime': pd.date_range('2019-10-01 00:30', periods=20, freq='30T'), 'hr': pd.np.random.randint(80,120,20)})
df_notes = pd.DataFrame( {'hadm_id': pd.np.tile([111111, 222222],3), 'ne_charttime': pd.date_range('2019-10-01 00:45', periods=6, freq='40T'), 'note': [''.join(pd.np.random.choice(list(string.ascii_letters), 10)) for _ in range(6)]}).sort_values('hadm_id')

final_df = pd.merge_asof(df_notes.sort_values('ne_charttime'), df_str, left_on='ne_charttime', right_on='ce_charttime', by='hadm_id').sort_values(['hadm_id', 'ne_charttime'])

print(df_str); print(df_notes); print(final_df)

输出:

    hadm_id        ce_charttime   hr
0    111111 2019-10-01 00:30:00  118
1    222222 2019-10-01 01:00:00   93
2    111111 2019-10-01 01:30:00   92
3    222222 2019-10-01 02:00:00   86
4    111111 2019-10-01 02:30:00   88
5    222222 2019-10-01 03:00:00   86
6    111111 2019-10-01 03:30:00  106
7    222222 2019-10-01 04:00:00   91
8    111111 2019-10-01 04:30:00  109
9    222222 2019-10-01 05:00:00   95
10   111111 2019-10-01 05:30:00  113
11   222222 2019-10-01 06:00:00   92
12   111111 2019-10-01 06:30:00  104
13   222222 2019-10-01 07:00:00   83
14   111111 2019-10-01 07:30:00  114
15   222222 2019-10-01 08:00:00   98
16   111111 2019-10-01 08:30:00  110
17   222222 2019-10-01 09:00:00   89
18   111111 2019-10-01 09:30:00   98
19   222222 2019-10-01 10:00:00  109
   hadm_id        ne_charttime        note
0   111111 2019-10-01 00:45:00  jOcRWVdPDF
2   111111 2019-10-01 02:05:00  mvScJNrwra
4   111111 2019-10-01 03:25:00  FBAFbJYflE
1   222222 2019-10-01 01:25:00  ilNuInOsYZ
3   222222 2019-10-01 02:45:00  ysyolaNmkV
5   222222 2019-10-01 04:05:00  wvowGGETaP
   hadm_id        ne_charttime        note        ce_charttime   hr
0   111111 2019-10-01 00:45:00  jOcRWVdPDF 2019-10-01 00:30:00  118
2   111111 2019-10-01 02:05:00  mvScJNrwra 2019-10-01 01:30:00   92
4   111111 2019-10-01 03:25:00  FBAFbJYflE 2019-10-01 02:30:00   88
1   222222 2019-10-01 01:25:00  ilNuInOsYZ 2019-10-01 01:00:00   93
3   222222 2019-10-01 02:45:00  ysyolaNmkV 2019-10-01 02:00:00   86
5   222222 2019-10-01 04:05:00  wvowGGETaP 2019-10-01 04:00:00   91

这篇关于 pandas 根据时差合并两个数据帧的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆