将变量值重塑为列的最快​​方法 [英] Fastest way to reshape variable values as columns

查看:44
本文介绍了将变量值重塑为列的最快​​方法的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个包含大约 300 万行和以下结构的数据集:

I have a dataset with about 3 million rows and the following structure:

PatientID| Year | PrimaryConditionGroup
---------------------------------------
1        | Y1   | TRAUMA
1        | Y1   | PREGNANCY
2        | Y2   | SEIZURE
3        | Y1   | TRAUMA

作为 R 的新手,我很难找到正确的方法将数据重塑为下面概述的结构:

Being fairly new to R, I have some trouble finding the right way to reshape the data into the structure outlined below:

PatientID| Year | TRAUMA | PREGNANCY | SEIZURE
----------------------------------------------
1        | Y1   | 1      | 1         | 0
2        | Y2   | 0      | 0         | 1
3        | Y1   | 1      | 0         | 1

我的问题是:创建 data.frame 的最快/最优雅的方法是什么,其中 PrimaryConditionGroup 的值成为列,按 PatientID 和 Year(计算出现次数)分组?

My question is: What is the fastest/most elegant way to create a data.frame, where the values of PrimaryConditionGroup become columns, grouped by PatientID and Year (counting the number of occurences)?

推荐答案

可能有更简洁的方法来做到这一点,但就纯粹的速度而言,很难击败基于 data.table 的解决方案:

There are probably more succinct ways of doing this, but for sheer speed, it's hard to beat a data.table-based solution:

df <- read.table(text="PatientID Year  PrimaryConditionGroup
1         Y1    TRAUMA
1         Y1    PREGNANCY
2         Y2    SEIZURE
3         Y1    TRAUMA", header=T)

library(data.table)
dt <- data.table(df, key=c("PatientID", "Year"))

dt[ , list(TRAUMA =    sum(PrimaryConditionGroup=="TRAUMA"),
           PREGNANCY = sum(PrimaryConditionGroup=="PREGNANCY"),
           SEIZURE =   sum(PrimaryConditionGroup=="SEIZURE")),
   by = list(PatientID, Year)]

#      PatientID Year TRAUMA PREGNANCY SEIZURE
# [1,]         1   Y1      1         1       0
# [2,]         2   Y2      0         0       1
# [3,]         3   Y1      1         0       0

aggregate() 提供了一个基本 R"解决方案,它可能更惯用也可能不是.(唯一的复杂之处是聚合返回一个矩阵,而不是一个 data.frame;下面的第二行解决了这个问题.)

aggregate() provides a 'base R' solution that might or might not be more idiomatic. (The sole complication is that aggregate returns a matrix, rather than a data.frame; the second line below fixes that up.)

out <- aggregate(PrimaryConditionGroup ~ PatientID + Year, data=df, FUN=table)
out <- cbind(out[1:2], data.frame(out[3][[1]]))

第二次编辑 最后,使用 reshape 包的简洁解决方案可以让您达到同样的目的.

2nd EDIT Finally, a succinct solution using the reshape package gets you to the same place.

library(reshape)
mdf <- melt(df, id=c("PatientID", "Year"))
cast(PatientID + Year ~ value, data=j, fun.aggregate=length)

这篇关于将变量值重塑为列的最快​​方法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆