如何使用dplyr计算R中的分组z得分? [英] How do I calculate a grouped z score in R using dplyr?

查看:112
本文介绍了如何使用dplyr计算R中的分组z得分?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

使用 iris 数据集,我试图为每个变量计算z得分。通过执行以下操作,我获得了整洁的数据:

Using the iris dataset I'm trying to calculate a z score for each of the variables. I have the data in tidy format, by performing the following:

library(reshape2)
library(dplyr)
test <- iris
test <- melt(iris,id.vars = 'Species')

这给了我以下内容:

  Species     variable value
1  setosa Sepal.Length   5.1
2  setosa Sepal.Length   4.9
3  setosa Sepal.Length   4.7
4  setosa Sepal.Length   4.6
5  setosa Sepal.Length   5.0
6  setosa Sepal.Length   5.4

但是当我尝试为每个组创建z得分列时(例如Sepal.Length不能与Sepal.Width相比)。

But when I try to create a z-score column for each group (e.g. the z-score for Sepal.Length will not be comparable to that of Sepal. Width) using the following:

test <- test %>% 
  group_by(Species, variable) %>% 
  mutate(z_score = (value - mean(value)) / sd(value))

结果z得分尚未分组,并且基于所有数据。

The resulting z-scores have not been grouped, and are based on all of the data.

使用dpylr按组返回z分数的最佳方法是什么?

What's the best way to return the z-scores by group using dpylr?

非常感谢!

推荐答案

您的代码按组为您提供z得分。在我看来,这些z得分应该完全可比,因为您已经将各个组分别缩放为均值= 0和sd = 1,而不是根据整数的均值和sd来缩放每个值数据框。例如:

Your code is giving you z-scores by group. It seems to me these z-scores should be comparable exactly because you've individually scaled each group to mean=0 and sd=1, rather than scaling each value based on the mean and sd of the full data frame. For example:

library(tidyverse)

首先,设置融化的数据框:

First, set up the melted data frame:

dat = iris %>% 
  gather(variable, value, -Species) %>%
  group_by(Species, variable) %>% 
  mutate(z_score_group = (value - mean(value)) / sd(value)) %>%   # You can also use scale(value) as pointed out by @RuiBarradas
  ungroup %>% 
  mutate(z_score_ungrouped = (value - mean(value)) / sd(value)) 

现在看一下前三行并与直接计算进行比较:

Now look at the first three rows and compare with direct calculation:

head(dat, 3)

#   Species     variable value z_score_group z_score_ungrouped
# 1  setosa Sepal.Length   5.1     0.2666745         0.8278959
# 2  setosa Sepal.Length   4.9    -0.3007180         0.7266552
# 3  setosa Sepal.Length   4.7    -0.8681105         0.6254145

# z-scores by group
with(dat, (value[1:3] - mean(value[Species=="setosa" & variable=="Sepal.Length"])) / sd(value[Species=="setosa" & variable=="Sepal.Length"]))

# [1]  0.2666745 -0.3007180 -0.8681105

# ungrouped z-scores
with(dat, (value[1:3] - mean(value)) / sd(value))

# [1] 0.8278959 0.7266552 0.6254145

现在可视化z分数:下面的第一张图是原始的数据。第二个是未分组的z得分-我们刚刚将数据重新缩放为总体均值= 0和SD = 1。第三张图是您的代码产生的结果。每个组都分别缩放为均值= 0和SD = 1。

Now visualize the z-scores: The first graph below is the raw data. The second is the ungrouped z-scores--we've just rescaled the data to an overall mean=0 and SD=1. The third graph is what your code produces. Each group has been individually scaled to mean=0 and SD=1.

gridExtra::grid.arrange(
  grobs=setNames(names(dat)[c(3,5,4)], names(dat)[c(3,5,4)]) %>% 
    map(~ ggplot(dat %>% mutate(group=paste(Species,variable,sep="_")), 
                 aes_string(.x, colour="group")) + geom_density()),
  ncol=1)

这篇关于如何使用dplyr计算R中的分组z得分?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆