Pandas - 如何按日期选择事件并创建新的有序数据框.- 手术患者 [英] Pandas - How to pick events by date and create a new ordered dataframe. - surgery patients
问题描述
我是一名从事神经外科手术的外科医生.我有一个包含 600,000 条记录的数据框和 70 列,其中包含大约 7 个日期列,用于说明医院系统中患者在 6 年内发生的各种事件.一世我对修复颅骨的颅骨植入物感兴趣.
此数据框中有 4000 条记录,显示了插入或移除植入物的操作代码.大约 900 名患者进行了一次以上的手术,大约有 500 次插入和 500 次左右的植入物移除(用于感染等).我将操作日期作为 pd.datetime.我有患者加密的 ID.
这 600,000 条记录跨越 6 年.我需要分析有多个操作的900.我需要按日期订购操作,因为这只是一个及时的快照.例如,患者可以在开始快照数据收集之前植入植入物,然后在快照期间将其移除,然后在快照期间重新插入.相反,可以有相反的情况, - 在快照期间插入和删除.所以我想确定删除插入和插入删除的数量,以及之间的时间.
理想情况下,我想要一个患者 ID 表作为索引,插入 + 移除日期作为字段.然后我可以计算两者之间的时间.
我是 Python 新手,- 可以做基本的过滤、分组、交叉表等,但还不能做循环.非常感谢.
ID OP_code OPDATE_011 xxx V259 2014-12-122 xxx A082 2014-06-233 999 V011 2014-08-074 xxx A023 2014-09-12………………473231 xxx A651 2018-10-03473233 999 V014 2018-07-06473235 xxx A263 2018-05-18
这里有一些数据,行是个别的护理事件,因此患者 ID 列不是唯一的.如上所示,患者 ID 999 于 2014 年 8 月 7 日植入了植入物(代码 V011),然后于 2018 年 7 月 6 日将其取出(代码 V014).所以我想要的是一张
表ID.OPDATE1.OP_01_code OPDATE2.OP_02_code999. 2014-08-07.V011 2018-07-06.V014
为此,我必须按 ID 在 4000 个记录数据框中搜索 3000 个左右的个体 ID 患者,以获取每个患者的个体操作,然后按照上表对它们进行排序.显然,大多数人只会进行一次手术.
更新- 在下面的@Arne 建议之后.
<预><代码>显示(df_implants)OPDATE_01 OPERTN_01ENCRYPTED_HESID1111 [2019-01-26] [V011]1112 [2019-01-22] [V011]1113 [2015-09-24] [V011]1114 [2016-06-21, 2017-02-27] [V011, V014]1115 [2018-12-27] [V011]…………3046 [2017-02-18] [V011]3047 [2013-06-08] [V011]我已将下面的过滤条件更改为至少两个不同的感兴趣的 OP.
这是一种方法.出于测试目的,我对您的数据进行了一些更改.
将pandas导入为pddf = pd.DataFrame({'ID': [1, 2, 999, 3, 1, 999, 2],'OP_code': ['V011', 'A082', 'V011', 'V011', 'A651', 'V014', 'A263'],'OP_date': ['2014-12-12', '2014-06-23', '2014-08-07', '2014-09-12','2018-10-03'、'2018-07-06'、'2018-05-18']})df.set_index('ID', inplace=True)显示(df)
OP_code OP_dateID1 V011 2014-12-122 A082 2014-06-23999 V011 2014-08-073 V011 2014-09-121 A651 2018-10-03999 V014 2018-07-062 A263 2018-05-18
首先,我们应该转换数据,以便每个患者只有一行,从列表中的多个 OP 收集数据:
df_patients = pd.pivot_table(df, index=df.index, aggfunc=list)显示(df_患者)
OP_code OP_dateID1 [V011, A651] [2014-12-12, 2018-10-03]2 [A082, A263] [2014-06-23, 2018-05-18]3 [V011] [2014-09-12]999 [V011, V014] [2014-08-07, 2018-07-06]
现在给出与您感兴趣的植入物对应的 OP 代码列表,我们可以遍历此 DataFrame 的行,以仅创建具有至少两种不同 OP 的患者的索引兴趣.然后我们就可以根据这个新的索引来过滤数据了.
implant_codes = {'V011', 'V014'}植入索引 = []对于 df_patients.index 中的 i:""" 过滤标准收紧到至少两个不同的相关 OP,即implant_codes 的交集list 与患者的 OP 列表至少有两个元素."""如果 len(implant_codes.intersection(df_patients.OP_code[i])) >= 2:植入索引.append(i)df_implants = df_patients.filter(implant_index,axis=0)显示(df_implants)
OP_code OP_dateID999 [V011, V014] [2014-08-07, 2018-07-06]
您可以通过结合 DataFrames 和列表的索引语法来访问这里的数据元素,例如df_implants.loc[999, 'OP_date'][0]
得出患者 999 的第一个 OP 日期:'2014-08-07'
我不建议为每个 OP 创建单独的列.你可以试试这样的:
df_implants[['OP_date_1', 'OP_date_2']] = pd.DataFrame(df_implants.OP_date.values.tolist(),索引=df_implants.index)显示(df_implants)
OP_code OP_date OP_date_1 OP_date_2ID999 [V011, V014] [2014-08-07, 2018-07-06] 2014-08-07 2018-07-06
然而,这种方法在实践中会遇到麻烦,因为 OP 的数量因患者而异.这就是为什么我认为上面给出的列表表示更自然,更容易处理.
I am a surgeon looking at neurosurgery. I have a dataframe of 600,000 records, and 70 columns with about 7 date columns for various events that happened to patients in a hospital sytem over a 6 year period. i I am interested in cranial implants to repair the skull.
There are 4000 records from this dataframe that show a code for an operation to either insert or remove an implant. About 900 patients had more than one operation with about 500 insertions and 500 or so removals of implants (for infection etc). I have the dates of the operations as pd.datetime. I have the patients encrypted id.
The 600,000 records span a 6 year period. I need to analyse the 900 who had multiple operations. I need to order the operations datewise because this is just a snapshot in time. eg a patient could have had an implant put in before the data collection of the snapshot started, then had it removed during the snapshot, then reinserted during in the snapshot. Conversely one could have had the reverse, - insertion and removal during the snapshot. So I want to establish the numbers of removal-insertion , and insertion-removal, - and the time between.
Ideally I'd like a table of patient id as index, with insertion + removal date as fields. i can then calculate the time between.
I am new to python, - can do basic filtering, groupby, crosstab etc but not loops yet. many thanks.
ID OP_code OPDATE_01
1 xxx V259 2014-12-12
2 xxx A082 2014-06-23
3 999 V011 2014-08-07
4 xxx A023 2014-09-12
... ... ... ...
473231 xxx A651 2018-10-03
473233 999 V014 2018-07-06
473235 xxx A263 2018-05-18
Heres some data, the rows are individual episodes of care so the patient ID column is not unique. So above, patient ID 999 had an implant put in (code V011) on 2014-08-07, and then had it taken out (code V014) on 2018-07-06. So what I'd like is a table of
ID. OPDATE1. OP_01_code OPDATE2. OP_02_code
999. 2014-08-07. V011 2018-07-06. V014
To do this I would have to search the 3000 or so individual ID patients in the 4000 record dataframe by ID to get the individual operations for each patient then order them as the table above. Obviously the majority would have only had one operation.
Update - after @Arne suggestion below.
display(df_implants)
OPDATE_01 OPERTN_01
ENCRYPTED_HESID
1111 [2019-01-26] [V011]
1112 [2019-01-22] [V011]
1113 [2015-09-24] [V011]
1114 [2016-06-21, 2017-02-27] [V011, V014]
1115 [2018-12-27] [V011]
... ... ...
3046 [2017-02-18] [V011]
3047 [2013-06-08] [V011]
Edit: I've changed the filter criterion below to at least two different OPs of interest.
Here is one way to do this. I've changed your data somewhat for testing purposes.
import pandas as pd
df = pd.DataFrame({'ID': [1, 2, 999, 3, 1, 999, 2],
'OP_code': ['V011', 'A082', 'V011', 'V011', 'A651', 'V014', 'A263'],
'OP_date': ['2014-12-12', '2014-06-23', '2014-08-07', '2014-09-12',
'2018-10-03', '2018-07-06', '2018-05-18']})
df.set_index('ID', inplace=True)
display(df)
OP_code OP_date
ID
1 V011 2014-12-12
2 A082 2014-06-23
999 V011 2014-08-07
3 V011 2014-09-12
1 A651 2018-10-03
999 V014 2018-07-06
2 A263 2018-05-18
First we should transform the data so that there is exactly one row per patient, collecting the data from multiple OPs in lists:
df_patients = pd.pivot_table(df, index=df.index, aggfunc=list)
display(df_patients)
OP_code OP_date
ID
1 [V011, A651] [2014-12-12, 2018-10-03]
2 [A082, A263] [2014-06-23, 2018-05-18]
3 [V011] [2014-09-12]
999 [V011, V014] [2014-08-07, 2018-07-06]
Now given a list of the OP codes that correspond to the implants you're interested in, we can loop through the rows of this DataFrame to create an index of only those patients that had at least two different OPs of interest. Then we can filter the data according to this new index.
implant_codes = {'V011', 'V014'}
implant_index = []
for i in df_patients.index:
"""EDIT: filter criterion tightened to at least two different
relevant OPs, i.e. the intersection of the implant_codes
list with the patient's OP list has at least two elements."""
if len(implant_codes.intersection(df_patients.OP_code[i])) >= 2:
implant_index.append(i)
df_implants = df_patients.filter(implant_index, axis=0)
display(df_implants)
OP_code OP_date
ID
999 [V011, V014] [2014-08-07, 2018-07-06]
You can access data elements here by a combination of the indexing syntax for DataFrames and lists, e.g. df_implants.loc[999, 'OP_date'][0]
yields the first OP date of patient 999: '2014-08-07'
I would not recommend creating a separate column for each OP. You could try something like this:
df_implants[['OP_date_1', 'OP_date_2']] = pd.DataFrame(df_implants.OP_date.values.tolist(),
index=df_implants.index)
display(df_implants)
OP_code OP_date OP_date_1 OP_date_2
ID
999 [V011, V014] [2014-08-07, 2018-07-06] 2014-08-07 2018-07-06
However, this approach will run into trouble in practice, due to the fact that the number of OPs varies across patients. That's why I think the list representation given above is more natural and easier to handle.
这篇关于Pandas - 如何按日期选择事件并创建新的有序数据框.- 手术患者的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!