如何在不知道R中的ID的情况下基于唯一ID对列进行汇总? [英] How do you summarize columns based on unique IDs without knowing IDs in R?
问题描述
我一直在关于总结数据的帖子,但似乎没有找到我要找的。
我想创建一个摘要计数表,这将允许我查看给病人使用某种药物的频率。一些患者同时接受多种药物并不重要,因为我只想要所有给药的总结,然后计算给予的所有药物的每个药物类别的百分比。问题是,我不知道可能的药物的名称,他们隐藏在 data.frame
某处,因此,我必须指定哪些列R必须首先查看以创建列表,然后由其可以总结列。
我预计这将指向 plyr
包,但是我尝试正确使用函数工作到现在。
我的 df
看起来像这样
x < - sample(letters [1:4],20,replace = TRUE)
y< - sample(letters [1:5],20,replace = TRUE)
z< - sample(letters [1:6],20,replace = TRUE)
df< -data.frame(x,y,z)
head(df)
xyz
1 aaf
2 acd
3 bbe
4 cdb
5 aab
6 cdd
,你可以看到, data.frame
包含三列具有相同但不同的字母,
我现在想要做的是创建一个唯一字符列表,
unique(x)
unique(y)
unique(z)
作为我的参考列表,然后R可以汇总每列中的计数。
摘要(df)
返回每列的计数摘要,但不返回每个ID本身的计数,也不包含所有唯一计数的百分比。
我也尝试过下面这种方式,但在理想情况下,我想有一个唯一的字符列表,我可以喂
length
参数ddply(df,。 ,count = length(unique(y)))
帮助非常感激。
解决方案如果你只想对整个数据框进行计数,可以使用
table (df))
(另见@ goctlr的答案)&如果你也想有概率:prop.table(table(unlist(df)))
。
要获取每一列的计数和总计数,我写了以下函数:
#一些可重现的数据:
set.seed(1)
x < [1:4],20,replace = TRUE)
y< - sample(letters [1:5],20,replace = TRUE)
z& 20,replace = TRUE)
df< - data.frame(x,y,z)
#函数
func < - function(x){
x2< - data.frame()
nms< - names(x)
id< - sort(unique(unlist(x)))
for :length(id)){
for(j in 1:length(nms)){
x2 [i,j] )
}
}
names(x2)< - nms
x2 $ total< - rowSums(x2)
x2< - cbind(id,x2 )
assign(dat,x2,envir = .GlobalEnv)
}
使用
func(df)
执行函数将在全局环境中为您提供一个数据框dat
:> dat
id xyz total
1 a 4 4 3 11
2 b 5 5 2 12
3 c 5 4 4 13
4 d 6 4 5 15
5 e 0 3 5 8
6 f 0 0 1 1
您可以使用例如
dplyr
包计算百分比:(dplyr)
dat< - dat%>%mutate(xperc = round(100 * x / sum(total),1),
yperc = round(100 * y / sum ,1),
zperc = round(100 * z / sum(total),1),
perc = round(100 * total / sum(total),1))
其结果是:
> dat
id xyz total xperc yperc zperc perc
1 a 4 4 3 11 6.7 6.7 5.0 18.3
2 b 5 5 2 12 8.3 8.3 3.3 20.0
3 c 5 4 4 13 8.3 6.7 6.7 21.7
4 d 6 4 5 15 10.0 6.7 8.3 25.0
5 e 0 3 5 8 0.0 5.0 8.3 13.3
6 f 0 0 1 1 0.0 0.0 1.7 1.7
I've been going through the posts regarding summarizing data, but haven't seem to have found what I'm looking for.
I wish to create a summary "count-table" which will allow me to see how often a certain medication was given to patients. The fact that some patients received multiple medications simultaneously doesn't matter, because I simply want a summary of all the medication given and then calculate which percentage each medication class is of all medication given. The issue is, that I don't know the names of the possible medication given, they're "hidden" somewhere in the
data.frame
, thus, I have to specify which columns R would have to look through first to create a "list" by which it can then summarize the columns.I anticipate that this points towards the
plyr
package but my attempts to use the functions in it correctly haven't worked until now.My
df
looks something like thisx <- sample(letters[1:4], 20, replace = TRUE) y <- sample(letters[1:5], 20, replace = TRUE) z <- sample(letters[1:6], 20, replace = TRUE) df<-data.frame(x,y,z) head(df) x y z 1 a a f 2 a c d 3 b b e 4 c d b 5 a a b 6 c d d
as you can see, the
data.frame
contains three columns which have the same but also different letters, indicating the name of the medication given.What I'd now like to do is create a list of unique characters,
unique(x) unique(y) unique(z)
which serves as my reference list by which R can then summarize the counts in each column.
summary(df)
returns a summary of counts of each column but not of each ID itself and also without a percentage of all unique counts.
I also tried the following, which sort of goes in the right direction, but ideally, I'd like to have a list of unique characters, which I can feed to the
length
argumentddply(df, .(x), summarize, counts=length(unique(y)))
Any idea how I could do this? Help much appreciated.
解决方案If you just want to have a count for the whole dataframe, you can use
table(unlist(df))
(see also @goctlr's answer) & if you also want to have probabilities:prop.table(table(unlist(df)))
. When you also want to get the count for the individual columns, it gets more difficult.To get the count for each column and the total count, I wrote the following function:
# some reproducible data: set.seed(1) x <- sample(letters[1:4], 20, replace = TRUE) y <- sample(letters[1:5], 20, replace = TRUE) z <- sample(letters[1:6], 20, replace = TRUE) df <- data.frame(x,y,z) # the function func <- function(x) { x2 <- data.frame() nms <- names(x) id <- sort(unique(unlist(x))) for(i in 1:length(id)) { for(j in 1:length(nms)) { x2[i,j] <- sum(x[,j] %in% id[i]) } } names(x2) <- nms x2$total <- rowSums(x2) x2 <- cbind(id,x2) assign("dat", x2, envir = .GlobalEnv) }
Executing the function with
func(df)
will give you a dataframedat
in your global envirenment:> dat id x y z total 1 a 4 4 3 11 2 b 5 5 2 12 3 c 5 4 4 13 4 d 6 4 5 15 5 e 0 3 5 8 6 f 0 0 1 1
After that, you can calculate the percentages with for example the
dplyr
package:library(dplyr) dat <- dat %>% mutate(xperc=round(100*x/sum(total),1), yperc=round(100*y/sum(total),1), zperc=round(100*z/sum(total),1), perc=round(100*total/sum(total),1))
which results in:
> dat id x y z total xperc yperc zperc perc 1 a 4 4 3 11 6.7 6.7 5.0 18.3 2 b 5 5 2 12 8.3 8.3 3.3 20.0 3 c 5 4 4 13 8.3 6.7 6.7 21.7 4 d 6 4 5 15 10.0 6.7 8.3 25.0 5 e 0 3 5 8 0.0 5.0 8.3 13.3 6 f 0 0 1 1 0.0 0.0 1.7 1.7
这篇关于如何在不知道R中的ID的情况下基于唯一ID对列进行汇总?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!