将变量值重塑为列的最快方法 [英] Fastest way to reshape variable values as columns
问题描述
我有一个包含大约 300 万行和以下结构的数据集:
I have a dataset with about 3 million rows and the following structure:
PatientID| Year | PrimaryConditionGroup
---------------------------------------
1 | Y1 | TRAUMA
1 | Y1 | PREGNANCY
2 | Y2 | SEIZURE
3 | Y1 | TRAUMA
作为 R 的新手,我很难找到正确的方法将数据重塑为下面概述的结构:
Being fairly new to R, I have some trouble finding the right way to reshape the data into the structure outlined below:
PatientID| Year | TRAUMA | PREGNANCY | SEIZURE
----------------------------------------------
1 | Y1 | 1 | 1 | 0
2 | Y2 | 0 | 0 | 1
3 | Y1 | 1 | 0 | 1
我的问题是:创建 data.frame 的最快/最优雅的方法是什么,其中 PrimaryConditionGroup 的值成为列,按 PatientID 和 Year(计算出现次数)分组?
My question is: What is the fastest/most elegant way to create a data.frame, where the values of PrimaryConditionGroup become columns, grouped by PatientID and Year (counting the number of occurences)?
推荐答案
可能有更简洁的方法来做到这一点,但就纯粹的速度而言,很难击败基于 data.table
的解决方案:
There are probably more succinct ways of doing this, but for sheer speed, it's hard to beat a data.table
-based solution:
df <- read.table(text="PatientID Year PrimaryConditionGroup
1 Y1 TRAUMA
1 Y1 PREGNANCY
2 Y2 SEIZURE
3 Y1 TRAUMA", header=T)
library(data.table)
dt <- data.table(df, key=c("PatientID", "Year"))
dt[ , list(TRAUMA = sum(PrimaryConditionGroup=="TRAUMA"),
PREGNANCY = sum(PrimaryConditionGroup=="PREGNANCY"),
SEIZURE = sum(PrimaryConditionGroup=="SEIZURE")),
by = list(PatientID, Year)]
# PatientID Year TRAUMA PREGNANCY SEIZURE
# [1,] 1 Y1 1 1 0
# [2,] 2 Y2 0 0 1
# [3,] 3 Y1 1 0 0
aggregate()
提供了一个基本 R"解决方案,它可能更惯用也可能不是.(唯一的复杂之处是聚合返回一个矩阵,而不是一个 data.frame;下面的第二行解决了这个问题.)
aggregate()
provides a 'base R' solution that might or might not be more idiomatic. (The sole complication is that aggregate returns a matrix, rather than a data.frame; the second line below fixes that up.)
out <- aggregate(PrimaryConditionGroup ~ PatientID + Year, data=df, FUN=table)
out <- cbind(out[1:2], data.frame(out[3][[1]]))
第二次编辑 最后,使用 reshape
包的简洁解决方案可以让您达到同样的目的.
2nd EDIT Finally, a succinct solution using the reshape
package gets you to the same place.
library(reshape)
mdf <- melt(df, id=c("PatientID", "Year"))
cast(PatientID + Year ~ value, data=j, fun.aggregate=length)
这篇关于将变量值重塑为列的最快方法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!