通过使用dplyr分组变量来表示多个列的响应 [英] Tabulate responses for multiple columns by grouping variable with dplyr

查看:107
本文介绍了通过使用dplyr分组变量来表示多个列的响应的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我是新来的plyr / dplyr家庭,但享受它。我可以看到它对我自己的工作来说是非常有用的,但是我正在试图让我的头脑紧张。

我有一个数据框架,如下图所示。



1)如何为每个非分组变量生成一个表,显示分组变量的每个值内的响应分布?



2)注意:我确实有一些缺失的值,我想从列表中排除它们。我意识到,summarize_each命令会将函数应用于每一列,但我不知道如何以简单的方式处理缺失的值问题。我看到一些代码,建议你必须过滤掉缺失的值,但是如果缺少的值随机分散在非分组变量中呢?



3)从根本上说,最好只是使用dplyr的完整案例?

  #library 
library(dplyr)
#sample data
group< -sample(c('A','B','C'),100,replace = TRUE )
var1< -sample(c(1,2,3,4,5,NA),100,replace = TRUE,prob = c(0.15,0.15,0.15,0.15,0.15,0.25))
var2< -sample(c(1,2,3,4,5,NA),100,replace = TRUE,prob = c(0.15,0.15,0.15,0.15,0.15,0.25))
var3< ; -sample(c(1,2,3,4,5,NA),100,replace = TRUE,prob = c(0.15,0.15,0.15,0.15,0.15,0.25))
df < 。框(组,var1,var2,var3)
#my代码
out_df< -df%>%group_by(group)
out_df%>%summarise_each(funs(table)


解决方案

您可以通过 var1 var2 var3 如果您将数据框融化为长格式,那么将三个 var 列堆叠到单个列(),然后创建一个附加列(变量),标记哪些行与哪个 var

 库(dplyr)
库(reshape2)

#sample data
组< - 样本(c('A','B','C'),100,replace = TRUE)
var1< - sample(c(1,2,3,4,5, NA),100,replace = TRUE,prob = c(0.15,0.15,0.15,0.15,0.15,0.25))
var2< - sample(c(1,2,3,4,5,NA) ,100,replace = TRUE,prob = c(0.15,0.15,0.15,0.15,0.15,0.25))
var3< - sample(c(1,2,3,4,5,NA),100 ,replace = TRUE,prob = c(0.15,0.15,0.15,0.15,0.15,0.25))

df< -data.frame(group,var1,var2,var3)

out_df< - df%>%
melt(id.var =group)%>%
过滤器(!is.na(value))%>%#删除NA
group_by(组,变量,值)%>%
总结(count = n())%>%
group_by(组,变量)%>%
mutate(percent = count / sum(count))

您可以随时停止功能链查看中间步骤,这将有助于了解每个步骤正在做什么。



因为我们按变量分组, ,我们最终得到 count 给我们这三列的组合的行数。然后,我们仅通过变量进行分组,以计算每个值的行的百分比计数有助于两个分组变量的每个组合。 (第二个 group_by 不是必需的,因为dplyr在总结操作后删除最后一个分组变量(因为只会对于所有原始分组变量的每个组合,可以是一行),但是我更喜欢明确重新组合。)



以下是最终结果:

  out_df 

组变量值计数百分比
1 A var1 1 6 0.26086957
2 A var1 2 3 0.13043478
3 A var1 3 6 0.26086957
4 A var1 4 1 0.04347826
5 A var1 5 7 0.30434783
...
41 C var3 1 6 0.25000000
42 C var3 2 5 0.20833333
43 C var3 3 4 0.16666667
44 C var3 4 2 0.08333333
45 C var3 5 7 0.29166667


Hi:I'm new to the plyr/dplyr family but enjoying it. I can see it's massive utility for my own work, but I'm stil trying to get my head around it.
I have a data frame that looks like below.

1) How do I produce a table for each non-grouping variable that shows the distribution of responses within each value of the grouping variable?

2) Note: I do have some missing values and I would like to exclude them from the tabulation. I realize the summarize_each command will apply the function to each column, but I don't know how to handle the missing values issue in a simple way. I have seen some codes that suggest you have to filter out missing values, but what if the missing values are scattered randomly through the non-grouping variables?

3) Fundamentally, is it best to just use complete cases with dplyr?

#library
library(dplyr)
#sample data
group<-sample(c('A', 'B', 'C'), 100, replace=TRUE)
var1<-sample(c(1,2,3,4,5,NA), 100, replace=TRUE,     prob=c(0.15,0.15,0.15,0.15,0.15,0.25))
var2<-sample(c(1,2,3,4,5,NA), 100, replace=TRUE, prob=c(0.15,0.15,0.15,0.15,0.15,0.25))
var3<-sample(c(1,2,3,4,5,NA), 100, replace=TRUE, prob=c(0.15,0.15,0.15,0.15,0.15,0.25))
df<-data.frame(group, var1, var2, var3)
#my code
out_df<-df %>%group_by(group)
out_df %>% summarise_each(funs(table))

解决方案

You can get counts by group for each of var1, var2, and var3 if you "melt" your data frame into long form first, which will "stack" the three var columns into a single column (value) and then create an additional column (variable) marking which rows go with which var.

library(dplyr)
library(reshape2)

#sample data
group <- sample(c('A', 'B', 'C'), 100, replace=TRUE)
var1 <- sample(c(1,2,3,4,5,NA), 100, replace=TRUE, prob=c(0.15,0.15,0.15,0.15,0.15,0.25))
var2 <- sample(c(1,2,3,4,5,NA), 100, replace=TRUE, prob=c(0.15,0.15,0.15,0.15,0.15,0.25))
var3 <- sample(c(1,2,3,4,5,NA), 100, replace=TRUE, prob=c(0.15,0.15,0.15,0.15,0.15,0.25))

df<-data.frame(group, var1, var2, var3)

out_df <- df %>% 
  melt(id.var="group") %>%
  filter(!is.na(value)) %>%  # Remove NA
  group_by(group, variable, value) %>%
  summarise(count=n()) %>% 
  group_by(group, variable) %>% 
  mutate(percent=count/sum(count))  

You can stop the function chain at any point to look at the intermediate steps, which will help in understanding what each step is doing.

Because we grouped by group, variable, and value, we end up with count giving us the number of rows for combination of those three columns. Then we group only by group and variable to calculate the percentage of rows that each value of count contributes to each combination of the two grouping variables. (The second group_by is not essential, because dplyr drops the last grouping variable after a summarise operation (because there will only be one row for each combination of all the original grouping variables) but I prefer to regroup explicitly.)

Here's the final result:

out_df

   group variable value count    percent
1      A     var1     1     6 0.26086957
2      A     var1     2     3 0.13043478
3      A     var1     3     6 0.26086957
4      A     var1     4     1 0.04347826
5      A     var1     5     7 0.30434783
...
41     C     var3     1     6 0.25000000
42     C     var3     2     5 0.20833333
43     C     var3     3     4 0.16666667
44     C     var3     4     2 0.08333333
45     C     var3     5     7 0.29166667

这篇关于通过使用dplyr分组变量来表示多个列的响应的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆