通过使用dplyr分组变量来表示多个列的响应 [英] Tabulate responses for multiple columns by grouping variable with dplyr
问题描述
我有一个数据框架,如下图所示。
1)如何为每个非分组变量生成一个表,显示分组变量的每个值内的响应分布?
2)注意:我确实有一些缺失的值,我想从列表中排除它们。我意识到,summarize_each命令会将函数应用于每一列,但我不知道如何以简单的方式处理缺失的值问题。我看到一些代码,建议你必须过滤掉缺失的值,但是如果缺少的值随机分散在非分组变量中呢?
3)从根本上说,最好只是使用dplyr的完整案例?
#library
library(dplyr)
#sample data
group< -sample(c('A','B','C'),100,replace = TRUE )
var1< -sample(c(1,2,3,4,5,NA),100,replace = TRUE,prob = c(0.15,0.15,0.15,0.15,0.15,0.25))
var2< -sample(c(1,2,3,4,5,NA),100,replace = TRUE,prob = c(0.15,0.15,0.15,0.15,0.15,0.25))
var3< ; -sample(c(1,2,3,4,5,NA),100,replace = TRUE,prob = c(0.15,0.15,0.15,0.15,0.15,0.25))
df < 。框(组,var1,var2,var3)
#my代码
out_df< -df%>%group_by(group)
out_df%>%summarise_each(funs(table)
您可以通过
var1
, var2
和 var3
如果您将数据框融化为长格式,那么将三个 var
列堆叠到单个列(值
),然后创建一个附加列(变量
),标记哪些行与哪个 var
。
库(dplyr)
库(reshape2)
#sample data
组< - 样本(c('A','B','C'),100,replace = TRUE)
var1< - sample(c(1,2,3,4,5, NA),100,replace = TRUE,prob = c(0.15,0.15,0.15,0.15,0.15,0.25))
var2< - sample(c(1,2,3,4,5,NA) ,100,replace = TRUE,prob = c(0.15,0.15,0.15,0.15,0.15,0.25))
var3< - sample(c(1,2,3,4,5,NA),100 ,replace = TRUE,prob = c(0.15,0.15,0.15,0.15,0.15,0.25))
df< -data.frame(group,var1,var2,var3)
out_df< - df%>%
melt(id.var =group)%>%
过滤器(!is.na(value))%>%#删除NA
group_by(组,变量,值)%>%
总结(count = n())%>%
group_by(组,变量)%>%
mutate(percent = count / sum(count))
您可以随时停止功能链查看中间步骤,这将有助于了解每个步骤正在做什么。
因为我们按组
,变量
分组, 值
,我们最终得到 count
给我们这三列的组合的行数。然后,我们仅通过组
和变量
进行分组,以计算每个值的行的百分比计数
有助于两个分组变量的每个组合。 (第二个 group_by
不是必需的,因为dplyr在总结
操作后删除最后一个分组变量(因为只会对于所有原始分组变量的每个组合,可以是一行),但是我更喜欢明确重新组合。)
以下是最终结果:
out_df
组变量值计数百分比
1 A var1 1 6 0.26086957
2 A var1 2 3 0.13043478
3 A var1 3 6 0.26086957
4 A var1 4 1 0.04347826
5 A var1 5 7 0.30434783
...
41 C var3 1 6 0.25000000
42 C var3 2 5 0.20833333
43 C var3 3 4 0.16666667
44 C var3 4 2 0.08333333
45 C var3 5 7 0.29166667
Hi:I'm new to the plyr/dplyr family but enjoying it. I can see it's massive utility for my own work, but I'm stil trying to get my head around it.
I have a data frame that looks like below.
1) How do I produce a table for each non-grouping variable that shows the distribution of responses within each value of the grouping variable?
2) Note: I do have some missing values and I would like to exclude them from the tabulation. I realize the summarize_each command will apply the function to each column, but I don't know how to handle the missing values issue in a simple way. I have seen some codes that suggest you have to filter out missing values, but what if the missing values are scattered randomly through the non-grouping variables?
3) Fundamentally, is it best to just use complete cases with dplyr?
#library
library(dplyr)
#sample data
group<-sample(c('A', 'B', 'C'), 100, replace=TRUE)
var1<-sample(c(1,2,3,4,5,NA), 100, replace=TRUE, prob=c(0.15,0.15,0.15,0.15,0.15,0.25))
var2<-sample(c(1,2,3,4,5,NA), 100, replace=TRUE, prob=c(0.15,0.15,0.15,0.15,0.15,0.25))
var3<-sample(c(1,2,3,4,5,NA), 100, replace=TRUE, prob=c(0.15,0.15,0.15,0.15,0.15,0.25))
df<-data.frame(group, var1, var2, var3)
#my code
out_df<-df %>%group_by(group)
out_df %>% summarise_each(funs(table))
You can get counts by group
for each of var1
, var2
, and var3
if you "melt" your data frame into long form first, which will "stack" the three var
columns into a single column (value
) and then create an additional column (variable
) marking which rows go with which var
.
library(dplyr)
library(reshape2)
#sample data
group <- sample(c('A', 'B', 'C'), 100, replace=TRUE)
var1 <- sample(c(1,2,3,4,5,NA), 100, replace=TRUE, prob=c(0.15,0.15,0.15,0.15,0.15,0.25))
var2 <- sample(c(1,2,3,4,5,NA), 100, replace=TRUE, prob=c(0.15,0.15,0.15,0.15,0.15,0.25))
var3 <- sample(c(1,2,3,4,5,NA), 100, replace=TRUE, prob=c(0.15,0.15,0.15,0.15,0.15,0.25))
df<-data.frame(group, var1, var2, var3)
out_df <- df %>%
melt(id.var="group") %>%
filter(!is.na(value)) %>% # Remove NA
group_by(group, variable, value) %>%
summarise(count=n()) %>%
group_by(group, variable) %>%
mutate(percent=count/sum(count))
You can stop the function chain at any point to look at the intermediate steps, which will help in understanding what each step is doing.
Because we grouped by group
, variable
, and value
, we end up with count
giving us the number of rows for combination of those three columns. Then we group only by group
and variable
to calculate the percentage of rows that each value of count
contributes to each combination of the two grouping variables. (The second group_by
is not essential, because dplyr drops the last grouping variable after a summarise
operation (because there will only be one row for each combination of all the original grouping variables) but I prefer to regroup explicitly.)
Here's the final result:
out_df
group variable value count percent
1 A var1 1 6 0.26086957
2 A var1 2 3 0.13043478
3 A var1 3 6 0.26086957
4 A var1 4 1 0.04347826
5 A var1 5 7 0.30434783
...
41 C var3 1 6 0.25000000
42 C var3 2 5 0.20833333
43 C var3 3 4 0.16666667
44 C var3 4 2 0.08333333
45 C var3 5 7 0.29166667
这篇关于通过使用dplyr分组变量来表示多个列的响应的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!