如何在不知道R中的ID的情况下基于唯一ID对列进行汇总? [英] How do you summarize columns based on unique IDs without knowing IDs in R?

查看:110
本文介绍了如何在不知道R中的ID的情况下基于唯一ID对列进行汇总?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我一直在关于总结数据的帖子,但似乎没有找到我要找的。



我想创建一个摘要计数表,这将允许我查看给病人使用某种药物的频率。一些患者同时接受多种药物并不重要,因为我只想要所有给药的总结,然后计算给予的所有药物的每个药物类别的百分比。问题是,我不知道可能的药物的名称,他们隐藏在 data.frame 某处,因此,我必须指定哪些列R必须首先查看以创建列表,然后由其可以总结列。



我预计这将指向 plyr 包,但是我尝试正确使用函数工作到现在。



我的 df 看起来像这样

  x < -  sample(letters [1:4],20,replace = TRUE)
y< - sample(letters [1:5],20,replace = TRUE)
z< - sample(letters [1:6],20,replace = TRUE)
df< -data.frame(x,y,z)
head(df)
xyz
1 aaf
2 acd
3 bbe
4 cdb
5 aab
6 cdd

,你可以看到, data.frame 包含三列具有相同但不同的字母,



我现在想要做的是创建一个唯一字符列表,

  unique(x)
unique(y)
unique(z)


作为我的参考列表,然后R可以汇总每列中的计数。

 摘要(df)

返回每列的计数摘要,但不返回每个ID本身的计数,也不包含所有唯一计数的百分比。



我也尝试过下面这种方式,但在理想情况下,我想有一个唯一的字符列表,我可以喂 length 参数

  ddply(df,。 ,count = length(unique(y)))

帮助非常感激。

解决方案

如果你只想对整个数据框进行计数,可以使用 table (df))(另见@ goctlr的答案)&如果你也想有概率: prop.table(table(unlist(df)))



要获取每一列的计数和总计数,我写了以下函数:

 #一些可重现的数据:
set.seed(1)
x < [1:4],20,replace = TRUE)
y< - sample(letters [1:5],20,replace = TRUE)
z& 20,replace = TRUE)
df< - data.frame(x,y,z)

#函数
func < - function(x){
x2< - data.frame()
nms< - names(x)
id< - sort(unique(unlist(x)))
for :length(id)){
for(j in 1:length(nms)){
x2 [i,j] )
}
}
names(x2)< - nms
x2 $ total< - rowSums(x2)
x2< - cbind(id,x2 )
assign(dat,x2,envir = .GlobalEnv)
}

使用 func(df)执行函数将在全局环境中为您提供一个数据框 dat

 > dat 
id xyz total
1 a 4 4 3 11
2 b 5 5 2 12
3 c 5 4 4 13
4 d 6 4 5 15
5 e 0 3 5 8
6 f 0 0 1 1

您可以使用例如 dplyr 包计算百分比:

  (dplyr)
dat< - dat%>%mutate(xperc = round(100 * x / sum(total),1),
yperc = round(100 * y / sum ,1),
zperc = round(100 * z / sum(total),1),
perc = round(100 * total / sum(total),1))

其结果是:

 > dat 
id xyz total xperc yperc zperc perc
1 a 4 4 3 11 6.7 6.7 5.0 18.3
2 b 5 5 2 12 8.3 8.3 3.3 20.0
3 c 5 4 4 13 8.3 6.7 6.7 21.7
4 d 6 4 5 15 10.0 6.7 8.3 25.0
5 e 0 3 5 8 0.0 5.0 8.3 13.3
6 f 0 0 1 1 0.0 0.0 1.7 1.7


I've been going through the posts regarding summarizing data, but haven't seem to have found what I'm looking for.

I wish to create a summary "count-table" which will allow me to see how often a certain medication was given to patients. The fact that some patients received multiple medications simultaneously doesn't matter, because I simply want a summary of all the medication given and then calculate which percentage each medication class is of all medication given. The issue is, that I don't know the names of the possible medication given, they're "hidden" somewhere in the data.frame, thus, I have to specify which columns R would have to look through first to create a "list" by which it can then summarize the columns.

I anticipate that this points towards the plyr package but my attempts to use the functions in it correctly haven't worked until now.

My df looks something like this

x <- sample(letters[1:4], 20, replace = TRUE)
y <- sample(letters[1:5], 20, replace = TRUE)
z <- sample(letters[1:6], 20, replace = TRUE)
df<-data.frame(x,y,z)
head(df)
  x y z
1 a a f
2 a c d
3 b b e
4 c d b
5 a a b
6 c d d

as you can see, the data.frame contains three columns which have the same but also different letters, indicating the name of the medication given.

What I'd now like to do is create a list of unique characters,

unique(x)
unique(y)
unique(z)

which serves as my reference list by which R can then summarize the counts in each column.

summary(df)

returns a summary of counts of each column but not of each ID itself and also without a percentage of all unique counts.

I also tried the following, which sort of goes in the right direction, but ideally, I'd like to have a list of unique characters, which I can feed to the length argument

ddply(df, .(x), summarize, counts=length(unique(y)))

Any idea how I could do this? Help much appreciated.

解决方案

If you just want to have a count for the whole dataframe, you can use table(unlist(df)) (see also @goctlr's answer) & if you also want to have probabilities: prop.table(table(unlist(df))). When you also want to get the count for the individual columns, it gets more difficult.

To get the count for each column and the total count, I wrote the following function:

# some reproducible data:
set.seed(1)
x <- sample(letters[1:4], 20, replace = TRUE)
y <- sample(letters[1:5], 20, replace = TRUE)
z <- sample(letters[1:6], 20, replace = TRUE)
df <- data.frame(x,y,z)

# the function
func <- function(x) {
  x2 <- data.frame()
  nms <- names(x)
  id <- sort(unique(unlist(x)))
  for(i in 1:length(id)) {
    for(j in 1:length(nms)) {
      x2[i,j] <- sum(x[,j] %in% id[i])
    }
  }
  names(x2) <- nms
  x2$total <- rowSums(x2)
  x2 <- cbind(id,x2)
  assign("dat", x2, envir = .GlobalEnv)
}

Executing the function with func(df) will give you a dataframe dat in your global envirenment:

> dat
  id x y z total
1  a 4 4 3    11
2  b 5 5 2    12
3  c 5 4 4    13
4  d 6 4 5    15
5  e 0 3 5     8
6  f 0 0 1     1

After that, you can calculate the percentages with for example the dplyr package:

library(dplyr)
dat <- dat %>% mutate(xperc=round(100*x/sum(total),1),
                      yperc=round(100*y/sum(total),1),
                      zperc=round(100*z/sum(total),1),
                      perc=round(100*total/sum(total),1))

which results in:

> dat
  id x y z total xperc yperc zperc perc
1  a 4 4 3    11   6.7   6.7   5.0 18.3
2  b 5 5 2    12   8.3   8.3   3.3 20.0
3  c 5 4 4    13   8.3   6.7   6.7 21.7
4  d 6 4 5    15  10.0   6.7   8.3 25.0
5  e 0 3 5     8   0.0   5.0   8.3 13.3
6  f 0 0 1     1   0.0   0.0   1.7  1.7

这篇关于如何在不知道R中的ID的情况下基于唯一ID对列进行汇总?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆