Pandas - 如何按日期选择事件并创建新的有序数据框.- 手术患者 [英] Pandas - How to pick events by date and create a new ordered dataframe. - surgery patients

查看:43
本文介绍了Pandas - 如何按日期选择事件并创建新的有序数据框.- 手术患者的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我是一名从事神经外科手术的外科医生.我有一个包含 600,000 条记录的数据框和 70 列,其中包含大约 7 个日期列,用于说明医院系统中患者在 6 年内发生的各种事件.一世我对修复颅骨的颅骨植入物感兴趣.

此数据框中有 4000 条记录,显示了插入或移除植入物的操作代码.大约 900 名患者进行了一次以上的手术,大约有 500 次插入和 500 次左右的植入物移除(用于感染等).我将操作日期作为 pd.datetime.我有患者加密的 ID.

这 600,000 条记录跨越 6 年.我需要分析有多个操作的900.我需要按日期订购操作,因为这只是一个及时的快照.例如,患者可以在开始快照数据收集之前植入植入物,然后在快照期间将其移除,然后在快照期间重新插入.相反,可以有相反的情况, - 在快照期间插入和删除.所以我想确定删除插入和插入删除的数量,以及之间的时间.

理想情况下,我想要一个患者 ID 表作为索引,插入 + 移除日期作为字段.然后我可以计算两者之间的时间.

我是 Python 新手,- 可以做基本的过滤、分组、交叉表等,但还不能做循环.非常感谢.

 ID OP_code OPDATE_011 xxx V259 2014-12-122 xxx A082 2014-06-233 999 V011 2014-08-074 xxx A023 2014-09-12………………473231 xxx A651 2018-10-03473233 999 V014 2018-07-06473235 xxx A263 2018-05-18

这里有一些数据,行是个别的护理事件,因此患者 ID 列不是唯一的.如上所示,患者 ID 999 于 2014 年 8 月 7 日植入了植入物(代码 V011),然后于 2018 年 7 月 6 日将其取出(代码 V014).所以我想要的是一张

ID.OPDATE1.OP_01_code OPDATE2.OP_02_code999. 2014-08-07.V011 2018-07-06.V014

为此,我必须按 ID 在 4000 个记录数据框中搜索 3000 个左右的个体 ID 患者,以获取每个患者的个体操作,然后按照上表对它们进行排序.显然,大多数人只会进行一次手术.

更新- 在下面的@Arne 建议之后.

<预><代码>显示(df_implants)OPDATE_01 OPERTN_01ENCRYPTED_HESID1111 [2019-01-26] [V011]1112 [2019-01-22] [V011]1113 [2015-09-24] [V011]1114 [2016-06-21, 2017-02-27] [V011, V014]1115 [2018-12-27] [V011]…………3046 [2017-02-18] [V011]3047 [2013-06-08] [V011]

解决方案

我已将下面的过滤条件更改为至少两个不同的感兴趣的 OP.

这是一种方法.出于测试目的,我对您的数据进行了一些更改.

将pandas导入为pddf = pd.DataFrame({'ID': [1, 2, 999, 3, 1, 999, 2],'OP_code': ['V011', 'A082', 'V011', 'V011', 'A651', 'V014', 'A263'],'OP_date': ['2014-12-12', '2014-06-23', '2014-08-07', '2014-09-12','2018-10-03'、'2018-07-06'、'2018-05-18']})df.set_index('ID', inplace=True)显示(df)

 OP_code OP_dateID1 V011 2014-12-122 A082 2014-06-23999 V011 2014-08-073 V011 2014-09-121 A651 2018-10-03999 V014 2018-07-062 A263 2018-05-18

首先,我们应该转换数据,以便每个患者只有一行,从列表中的多个 OP 收集数据:

df_patients = pd.pivot_table(df, index=df.index, aggfunc=list)显示(df_患者)

 OP_code OP_dateID1 [V011, A651] [2014-12-12, 2018-10-03]2 [A082, A263] [2014-06-23, 2018-05-18]3 [V011] [2014-09-12]999 [V011, V014] [2014-08-07, 2018-07-06]

现在给出与您感兴趣的植入物对应的 OP 代码列表,我们可以遍历此 DataFrame 的行,以仅创建具有至少两种不同 OP 的患者的索引兴趣.然后我们就可以根据这个新的索引来过滤数据了.

implant_codes = {'V011', 'V014'}植入索引 = []对于 df_patients.index 中的 i:""" 过滤标准收紧到至少两个不同的相关 OP,即implant_codes 的交集list 与患者的 OP 列表至少有两个元素."""如果 len(implant_codes.intersection(df_patients.OP_code[i])) >= 2:植入索引.append(i)df_implants = df_patients.filter(implant_index,axis=0)显示(df_implants)

 OP_code OP_dateID999 [V011, V014] [2014-08-07, 2018-07-06]

您可以通过结合 DataFrames 和列表的索引语法来访问这里的数据元素,例如df_implants.loc[999, 'OP_date'][0] 得出患者 999 的第一个 OP 日期:'2014-08-07'

我不建议为每个 OP 创建单独的列.你可以试试这样的:

df_implants[['OP_date_1', 'OP_date_2']] = pd.DataFrame(df_implants.OP_date.values.tolist(),索引=df_implants.index)显示(df_implants)

 OP_code OP_date OP_date_1 OP_date_2ID999 [V011, V014] [2014-08-07, 2018-07-06] 2014-08-07 2018-07-06

然而,这种方法在实践中会遇到麻烦,因为 OP 的数量因患者而异.这就是为什么我认为上面给出的列表表示更自然,更容易处理.

I am a surgeon looking at neurosurgery. I have a dataframe of 600,000 records, and 70 columns with about 7 date columns for various events that happened to patients in a hospital sytem over a 6 year period. i I am interested in cranial implants to repair the skull.

There are 4000 records from this dataframe that show a code for an operation to either insert or remove an implant. About 900 patients had more than one operation with about 500 insertions and 500 or so removals of implants (for infection etc). I have the dates of the operations as pd.datetime. I have the patients encrypted id.

The 600,000 records span a 6 year period. I need to analyse the 900 who had multiple operations. I need to order the operations datewise because this is just a snapshot in time. eg a patient could have had an implant put in before the data collection of the snapshot started, then had it removed during the snapshot, then reinserted during in the snapshot. Conversely one could have had the reverse, - insertion and removal during the snapshot. So I want to establish the numbers of removal-insertion , and insertion-removal, - and the time between.

Ideally I'd like a table of patient id as index, with insertion + removal date as fields. i can then calculate the time between.

I am new to python, - can do basic filtering, groupby, crosstab etc but not loops yet. many thanks.

        ID  OP_code  OPDATE_01

1       xxx V259    2014-12-12
2       xxx A082    2014-06-23
3       999 V011    2014-08-07
4       xxx A023    2014-09-12
... ... ... ...
473231  xxx A651    2018-10-03
473233  999 V014    2018-07-06
473235  xxx A263    2018-05-18

Heres some data, the rows are individual episodes of care so the patient ID column is not unique. So above, patient ID 999 had an implant put in (code V011) on 2014-08-07, and then had it taken out (code V014) on 2018-07-06. So what I'd like is a table of

ID.   OPDATE1.    OP_01_code    OPDATE2.   OP_02_code

999.  2014-08-07.  V011         2018-07-06.  V014

To do this I would have to search the 3000 or so individual ID patients in the 4000 record dataframe by ID to get the individual operations for each patient then order them as the table above. Obviously the majority would have only had one operation.

Update - after @Arne suggestion below.


display(df_implants)
                                    OPDATE_01                    OPERTN_01
ENCRYPTED_HESID     
1111                                [2019-01-26]                 [V011]
1112                                [2019-01-22]                 [V011]
1113                                [2015-09-24]                 [V011]
1114                                [2016-06-21, 2017-02-27]     [V011, V014]
1115                                [2018-12-27]                 [V011]
... ... ...
3046                                [2017-02-18]                 [V011]
3047                                [2013-06-08]                 [V011]

解决方案

Edit: I've changed the filter criterion below to at least two different OPs of interest.

Here is one way to do this. I've changed your data somewhat for testing purposes.

import pandas as pd

df = pd.DataFrame({'ID': [1, 2, 999, 3, 1, 999, 2],
                   'OP_code': ['V011', 'A082', 'V011', 'V011', 'A651', 'V014', 'A263'], 
                   'OP_date': ['2014-12-12', '2014-06-23', '2014-08-07', '2014-09-12', 
                               '2018-10-03', '2018-07-06', '2018-05-18']})
df.set_index('ID', inplace=True)
display(df)

   OP_code     OP_date
ID      
1    V011   2014-12-12
2    A082   2014-06-23
999  V011   2014-08-07
3    V011   2014-09-12
1    A651   2018-10-03
999  V014   2018-07-06
2    A263   2018-05-18

First we should transform the data so that there is exactly one row per patient, collecting the data from multiple OPs in lists:

df_patients = pd.pivot_table(df, index=df.index, aggfunc=list)
display(df_patients)

     OP_code        OP_date
ID      
1    [V011, A651]   [2014-12-12, 2018-10-03]
2    [A082, A263]   [2014-06-23, 2018-05-18]
3    [V011]         [2014-09-12]
999  [V011, V014]   [2014-08-07, 2018-07-06]

Now given a list of the OP codes that correspond to the implants you're interested in, we can loop through the rows of this DataFrame to create an index of only those patients that had at least two different OPs of interest. Then we can filter the data according to this new index.

implant_codes = {'V011', 'V014'}

implant_index = []
for i in df_patients.index:
    """EDIT: filter criterion tightened to at least two different 
       relevant OPs, i.e. the intersection of the implant_codes 
       list with the patient's OP list has at least two elements."""
    if len(implant_codes.intersection(df_patients.OP_code[i])) >= 2: 
        implant_index.append(i)

df_implants = df_patients.filter(implant_index, axis=0)
display(df_implants)

     OP_code       OP_date
ID      
999  [V011, V014]  [2014-08-07, 2018-07-06]

You can access data elements here by a combination of the indexing syntax for DataFrames and lists, e.g. df_implants.loc[999, 'OP_date'][0] yields the first OP date of patient 999: '2014-08-07'

I would not recommend creating a separate column for each OP. You could try something like this:

df_implants[['OP_date_1', 'OP_date_2']] = pd.DataFrame(df_implants.OP_date.values.tolist(), 
                                                       index=df_implants.index)
display(df_implants)

     OP_code       OP_date                   OP_date_1   OP_date_2
ID              
999  [V011, V014]  [2014-08-07, 2018-07-06]  2014-08-07  2018-07-06

However, this approach will run into trouble in practice, due to the fact that the number of OPs varies across patients. That's why I think the list representation given above is more natural and easier to handle.

这篇关于Pandas - 如何按日期选择事件并创建新的有序数据框.- 手术患者的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆