在不知道 R 中的 ID 的情况下,如何根据唯一 ID 汇总列? [英] How do you summarize columns based on unique IDs without knowing IDs in R?

查看:23
本文介绍了在不知道 R 中的 ID 的情况下,如何根据唯一 ID 汇总列?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我一直在浏览有关汇总数据的帖子,但似乎还没有找到我要找的东西.

I've been going through the posts regarding summarizing data, but haven't seem to have found what I'm looking for.

我希望创建一个汇总的计数表",让我可以查看给患者服用某种药物的频率.一些患者同时接受多种药物治疗这一事实并不重要,因为我只想总结所有给予的药物,然后计算每种药物类别占所有药物治疗的百分比.问题是,我不知道给定的可能药物的名称,它们隐藏"在 data.frame 中的某处,因此,我必须指定哪些列 R 必须首先查看以创建一个列表",然后可以通过该列表汇总列.

I wish to create a summary "count-table" which will allow me to see how often a certain medication was given to patients. The fact that some patients received multiple medications simultaneously doesn't matter, because I simply want a summary of all the medication given and then calculate which percentage each medication class is of all medication given. The issue is, that I don't know the names of the possible medication given, they're "hidden" somewhere in the data.frame, thus, I have to specify which columns R would have to look through first to create a "list" by which it can then summarize the columns.

我预计这指向 plyr 包,但我尝试正确使用其中的功能直到现在还没有奏效.

I anticipate that this points towards the plyr package but my attempts to use the functions in it correctly haven't worked until now.

我的df看起来像这样

x <- sample(letters[1:4], 20, replace = TRUE)
y <- sample(letters[1:5], 20, replace = TRUE)
z <- sample(letters[1:6], 20, replace = TRUE)
df<-data.frame(x,y,z)
head(df)
  x y z
1 a a f
2 a c d
3 b b e
4 c d b
5 a a b
6 c d d

如您所见,data.frame 包含三列,它们具有相同但也不同的字母,表示所给药物的名称.

as you can see, the data.frame contains three columns which have the same but also different letters, indicating the name of the medication given.

我现在想做的是创建一个唯一字符列表,

What I'd now like to do is create a list of unique characters,

unique(x)
unique(y)
unique(z)

作为我的参考列表,R 可以通过它总结每列中的计数.

which serves as my reference list by which R can then summarize the counts in each column.

summary(df)

返回每列计数的摘要,而不是每个 ID 本身的计数,也没有所有唯一计数的百分比.

returns a summary of counts of each column but not of each ID itself and also without a percentage of all unique counts.

我也尝试了以下方法,哪种方法是正确的,但理想情况下,我想要一个唯一字符列表,我可以将其提供给 length 参数

I also tried the following, which sort of goes in the right direction, but ideally, I'd like to have a list of unique characters, which I can feed to the length argument

ddply(df, .(x), summarize, counts=length(unique(y)))

知道我该怎么做吗?非常感谢帮助.

Any idea how I could do this? Help much appreciated.

推荐答案

如果您只想对整个数据框进行计数,可以使用 table(unlist(df))(另请参阅@goctlr 的回答) &如果您还想获得概率:prop.table(table(unlist(df))).当您还想获得单个列的计数时,它会变得更加困难.

If you just want to have a count for the whole dataframe, you can use table(unlist(df)) (see also @goctlr's answer) & if you also want to have probabilities: prop.table(table(unlist(df))). When you also want to get the count for the individual columns, it gets more difficult.

为了获得每列的计数和总计数,我编写了以下函数:

To get the count for each column and the total count, I wrote the following function:

# some reproducible data:
set.seed(1)
x <- sample(letters[1:4], 20, replace = TRUE)
y <- sample(letters[1:5], 20, replace = TRUE)
z <- sample(letters[1:6], 20, replace = TRUE)
df <- data.frame(x,y,z)

# the function
func <- function(x) {
  x2 <- data.frame()
  nms <- names(x)
  id <- sort(unique(unlist(x)))
  for(i in 1:length(id)) {
    for(j in 1:length(nms)) {
      x2[i,j] <- sum(x[,j] %in% id[i])
    }
  }
  names(x2) <- nms
  x2$total <- rowSums(x2)
  x2 <- cbind(id,x2)
  assign("dat", x2, envir = .GlobalEnv)
}

使用 func(df) 执行函数会给你一个数据帧 dat 在你的全局环境中:

Executing the function with func(df) will give you a dataframe dat in your global envirenment:

> dat
  id x y z total
1  a 4 4 3    11
2  b 5 5 2    12
3  c 5 4 4    13
4  d 6 4 5    15
5  e 0 3 5     8
6  f 0 0 1     1

之后,您可以使用例如 dplyr 包来计算百分比:

After that, you can calculate the percentages with for example the dplyr package:

library(dplyr)
dat <- dat %>% mutate(xperc=round(100*x/sum(total),1),
                      yperc=round(100*y/sum(total),1),
                      zperc=round(100*z/sum(total),1),
                      perc=round(100*total/sum(total),1))

导致:

> dat
  id x y z total xperc yperc zperc perc
1  a 4 4 3    11   6.7   6.7   5.0 18.3
2  b 5 5 2    12   8.3   8.3   3.3 20.0
3  c 5 4 4    13   8.3   6.7   6.7 21.7
4  d 6 4 5    15  10.0   6.7   8.3 25.0
5  e 0 3 5     8   0.0   5.0   8.3 13.3
6  f 0 0 1     1   0.0   0.0   1.7  1.7

这篇关于在不知道 R 中的 ID 的情况下,如何根据唯一 ID 汇总列?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆