dplyr 总结的变量结果,取决于输出变量命名 [英] Variable results with dplyr summarise, depending on output variable naming

查看:15
本文介绍了dplyr 总结的变量结果,取决于输出变量命名的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用 dplyr 包 (dplyr 0.4.3; R 3.2.3) 对分组数据进行基本汇总 (summarise),但得到不一致的结果('sd' 为 NaN,'N' 的计数不正确.更改输出的名称"会产生不同的影响(下面的示例).

I'm using the dplyr package (dplyr 0.4.3; R 3.2.3) for basic summary of grouped data (summarise), but get inconsistent results (NaN for 'sd', and incorrect count for 'N"). Changing the "name" of the output has variable effects (examples below).

迄今为止的结果摘要:

  • plyr 包未加载,我知道如果先加载可能会导致 dplyr 出现问题.
  • 使用或不使用 NA 数据(未显示)获得的结果相同.
  • 问题可以通过使用驼峰命名法变量命名(未显示)或使用名称中没有非字母数字分隔符的输出变量来解决.
  • 仍然根据."的组合获得有效结果.或输出列名称中的_".
  • plyr package is not loaded, which I know could cause problems with dplyrif loaded first.
  • Same results obtained with or without NA data (not shown).
  • Problem can be fixed by using camelCase variable naming (not shown) or by using an output variable without non-alphanumeric separator in name.
  • Valid results still obtained depending on the combinations of "." or "_" in output col names.

问题:虽然可以解决此问题,但我是否违反了我所违反的基本变量命名规则,或者是否存在需要解决的程序问题?我在总结中看到了其他具有可变行为的问题,但不完全是这样.

Question: Although this problem can be worked around, am I violating a basic variable naming rule that I'm violating, or is there a program issue that needs to be addressed? I've seen other questions with variable behavior with summarise, but not quite this.

谢谢,马特

示例数据:

library(dplyr)
df<-data_frame(id=c(1,1,1,2,2,2,3,3,3),
       time=rep(1:3, 3),
       glucose=c(90,150, 200,
                 100,150,200,
                 80,100,150))

示例:sd 给出 NaN 和不准确的 n

Example: sd gives NaN and inaccurate n

df %>% group_by(time) %>%
  summarise(glucose=mean(glucose, na.rm=TRUE),
        glucose.sd=sd(glucose, na.rm=TRUE),
        n=sum(!is.na(glucose)))

   time  glucose glucose.sd     n
  (int)    (dbl)      (dbl) (int)
1     1  90.0000        NaN     1
2     2 133.3333        NaN     1
3     3 183.3333        NaN     1

我想知道使用."是否有问题.名义上,或使用与数据框中相同的名称.从输出中删除现有的 df col 名称可解决此问题

I wondered if it was an issue with using either "." in name, or using the same name as in the dataframe. Removing existing df col names from the output fixes this

df %>% group_by(time) %>%
  summarise(avg=mean(glucose, na.rm=TRUE),
        stdv=sd(glucose, na.rm=TRUE),
        n=sum(!is.na(glucose)))

   time      avg     stdv     n
  (int)    (dbl)    (dbl) (int)
1     1  90.0000 10.00000     3
2     2 133.3333 28.86751     3
3     3 183.3333 28.86751     3

即使保留了glucose.sd",删除glucose"摘要也会修复它例子:去掉glucose"后,结果OK

Removing the "glucose" summary fixes it too even though "glucose.sd" is left Example: after removing "glucose", result is OK

df %>% group_by(time) %>%
  summarise(glucose.sd=sd(glucose, na.rm=TRUE),
        n=sum(!is.na(glucose)))

   time glucose.sd     n
  (int)      (dbl) (int)
1     1   10.00000     3
2     2   28.86751     3
3     3   28.86751     3

如果我为第一个摘要添加glucose.mean",它工作正常

If I add "glucose.mean" for first summary it works fine

df %>% group_by(time) %>%
  summarise(glucose.mean=mean(glucose, na.rm=TRUE),
            glucose.sd=sd(glucose, na.rm=TRUE),
            n=sum(!is.na(glucose)))

   time glucose.mean glucose.sd     n
  (int)        (dbl)      (dbl) (int)
1     1      90.0000   10.00000     3
2     2     133.3333   28.86751     3
3     3     183.3333   28.86751     3

使用不带."的变量名时同样的错误所以这不仅仅是使用."的问题.名义上

Same error when using variable name without "." So it's not just an issue with using "." in name

df %>% group_by(time) %>%
  summarise(glucose=mean(glucose, na.rm=TRUE),
        glucose_sd=sd(glucose, na.rm=TRUE),
        n=sum(!is.na(glucose)))

   time  glucose glucose_sd     n
  (int)    (dbl)      (dbl) (int)
1     1  90.0000        NaN     1
2     2 133.3333        NaN     1
3     3 183.3333        NaN     1

将glucose"重命名为glucose_mean"有效

Renaming "glucose" to "glucose_mean" works

df %>% group_by(time) %>%
  summarise(glucose_mean=mean(glucose, na.rm=TRUE),
        glucose_sd=sd(glucose, na.rm=TRUE),
        n=sum(!is.na(glucose)))

   time glucose_mean glucose_sd     n
  (int)        (dbl)      (dbl) (int)
1     1      90.0000   10.00000     3
2     2     133.3333   28.86751     3
3     3     183.3333   28.86751     3

推荐答案

您在 summarize 中指定的转换按它们出现的顺序执行,这意味着如果您更改变量值,那么这些新值出现在后续列中(这与基本函数 tranform() 不同).当你这样做

The transformations you specify in summarize are performed in the order they appear, that means if you change variable values, then those new values appear for the subsequent columns (this is different from the base function tranform()). When you do

df %>% group_by(time) %>%
  summarise(glucose=mean(glucose, na.rm=TRUE),
        glucose.sd=sd(glucose, na.rm=TRUE),
        n=sum(!is.na(glucose)))

glucose=mean(glucose, na.rm=TRUE) 部分改变了 glucose 变量的值,这样当你计算 glucose 时.sd=sd(glucose, na.rm=TRUE) 部分,sd() 没有看到原始的葡萄糖值,它看到的是原始值的平均值的新值.如果您对列重新排序,它将起作用.

The glucose=mean(glucose, na.rm=TRUE) part has changed the value of the glucose variable such that when you calculate the glucose.sd=sd(glucose, na.rm=TRUE) part, the sd() does not see the original glucose values, it see the new value that is the mean of the original values. If you re-order the columns, it will work.

df %>% group_by(time) %>%
  summarise(glucose.sd=sd(glucose, na.rm=TRUE),
        n=sum(!is.na(glucose)), 
        glucose=mean(glucose, na.rm=TRUE))

如果您想知道为什么这是默认行为,这是因为创建一个列然后在稍后的转换中使用该列值通常很好.例如,使用 mutate()

If you are wondering why this is the default behavior, this is because it is often nice to create a column and then use that column value later in the transformations. For example, with mutate()

df %>% group_by(time) %>%
  mutate(glucose_sq = glucose^2,
        glucose_sq_plus2 = glucose_sq+2)

这篇关于dplyr 总结的变量结果,取决于输出变量命名的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆