使用dplyr tidyr在汇总表中保留输入变量和因子水平的顺序 [英] Preserve order of input variables and factor levels in summary table, using dplyr tidyr

查看:68
本文介绍了使用dplyr tidyr在汇总表中保留输入变量和因子水平的顺序的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我喜欢 dplyr tidyr 如何轻松地创建具有多个预测变量和结果的单个汇总表变量。让我感到困扰的一件事是在输出表中保留/定义预测变量的顺序及其因子水平的最后一步。

I love how easy dplyr and tidyr have made it to create a single summary table with multiple predictor and outcome variables. One thing that got me stumped was the final step of preserving/defining the order of the predictor variables, and their factor levels, in the output table.

我提出了以下解决方案,其中涉及使用 mutate 手动创建一个因子变量,将预测值和预测值(例如 gender_female)与所需输出顺序中的级别组合在一起。但是,如果有很多变量,我的解决方案就会有些冗长,我想知道是否有更好的方法吗?

I've come up with a solution of sorts (below), which involves using mutate to manually make a factor variable that combines both the predictor and predictor value (eg. "gender_female") with levels in the desired output order. But my solution is a bit long winded if there are many variables, and I wonder if there is a better way?

library(dplyr)
library(tidyr)
levels_eth <- c("Maori", "Pacific", "Asian", "Other", "European", "Unknown")
levels_gnd <- c("Female", "Male", "Unknown")

set.seed(1234)

dat <- data.frame(
  gender    = factor(sample(levels_gnd, 100, replace = TRUE), levels = levels_gnd),
  ethnicity = factor(sample(levels_eth, 100, replace = TRUE), levels = levels_eth),
  outcome1  = sample(c(TRUE, FALSE), 100, replace = TRUE),
  outcome2  = sample(c(TRUE, FALSE), 100, replace = TRUE)
)

dat %>% 
  gather(key = outcome, value = outcome_value, contains("outcome")) %>%
  gather(key = predictor, value = pred_value, gender, ethnicity) %>%
  # Statement below creates variable for ordering output
  mutate(
    pred_ord = factor(interaction(predictor, addNA(pred_value), sep = "_"),
                      levels = c(paste("gender", levels(addNA(dat$gender)), sep = "_"),
                                 paste("ethnicity", levels(addNA(dat$ethnicity)), sep = "_")))
  ) %>%
  group_by(pred_ord, outcome) %>%
  summarise(n = sum(outcome_value, na.rm = TRUE)) %>%
  ungroup() %>%
  spread(key = outcome, value = n) %>%
  separate(pred_ord, c("Predictor", "Pred_value"))

Source: local data frame [9 x 4]

  Predictor Pred_value outcome1 outcome2
      (chr)      (chr)    (int)    (int)
1    gender     Female       25       27
2    gender       Male       11       10
3    gender    Unknown       12       15
4 ethnicity      Maori       10        9
5 ethnicity    Pacific        7        7
6 ethnicity      Asian        6       12
7 ethnicity      Other       10        9
8 ethnicity   European        5        4
9 ethnicity    Unknown       10       11
Warning message:
attributes are not identical across measure variables; they will be dropped 

上表是正确的,因为Predictor或Predictor值都不按字母顺序进行排序。

The table above is correct in that neither the Predictor nor Predictor values are resorted alphabetically.

编辑

根据要求,如果使用默认顺序(字母顺序),则生成此内容。有意义的是,将这些因素组合在一起后,它们将转换为字符变量,并且所有属性都将被删除。

As requested, this is what is produced if the default ordering (alphabetical) is used. It makes sense in that when the factors are combined they are converted to a character variable and all attributes are dropped.

dat %>% 
  gather(key = outcome, value = outcome_value, contains("outcome")) %>%
  gather(key = predictor, value = pred_value, gender, ethnicity) %>%
  group_by(predictor, pred_value, outcome) %>%
  summarise(n = sum(outcome_value, na.rm = TRUE)) %>%
  spread(key = outcome, value = n)

Source: local data frame [9 x 4]

  predictor pred_value outcome1 outcome2
      (chr)      (chr)    (int)    (int)
1 ethnicity      Asian        6       12
2 ethnicity   European        5        4
3 ethnicity      Maori       10        9
4 ethnicity      Other       10        9
5 ethnicity    Pacific        7        7
6 ethnicity    Unknown       10       11
7    gender     Female       25       27
8    gender       Male       11       10
9    gender    Unknown       12       15
Warning message:
attributes are not identical across measure variables; they will be dropped 


推荐答案

如果您希望数据这样排列的因素,您需要将它们转换回因素,例如 gather 强制转换为字符(它会警告您)。您可以使用 gather factor_key 参数来处理 predictor ,但您需要为 pred_value 组合级别,因为它现在结合了原始元素中的两个因素。简化一下:

If you want your data to be factors arranged as such, you'll need to convert them back to factors, as gather coerces to character (which it warns you about). You can use gather's factor_key parameter to take care of predictor, but you'll need to assemble levels for pred_value as it now combines two factors from the original. Simplifying a bit:

library(tidyr)
library(dplyr)

dat %>% 
    gather(key = predictor, value = pred_value, gender, ethnicity, factor_key = TRUE) %>%
    group_by(predictor, pred_value) %>% 
    summarise_all(sum) %>%
    ungroup() %>% 
    mutate(pred_value = factor(pred_value, levels = unique(c(levels_eth, levels_gnd), 
                                                           fromLast = TRUE))) %>% 
    arrange(predictor, pred_value)

## # A tibble: 9 × 4
##   predictor pred_value outcome1 outcome2
##      <fctr>     <fctr>    <int>    <int>
## 1    gender     Female       25       27
## 2    gender       Male       11       10
## 3    gender    Unknown       12       15
## 4 ethnicity      Maori       10        9
## 5 ethnicity    Pacific        7        7
## 6 ethnicity      Asian        6       12
## 7 ethnicity      Other       10        9
## 8 ethnicity   European        5        4
## 9 ethnicity    Unknown       10       11

请注意,您需要使用 unique 使用 fromLast = TRUE 将重复的未知值排列到单个出现在正确的位置; 工会会更早提出。

Note that you'll need to use unique with fromLast = TRUE to arrange the duplicate "Unknown" values into a single occurrence in the right place; union will put it earlier.

这篇关于使用dplyr tidyr在汇总表中保留输入变量和因子水平的顺序的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆