取消嵌套的标题栏:“宽” dplyr v1.0.0的数据摘要 [英] Unnesting tibble columns: "Wide" data summaries with dplyr v1.0.0

查看:67
本文介绍了取消嵌套的标题栏:“宽” dplyr v1.0.0的数据摘要的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想以这种格式生成宽数据汇总表:

I'd like to produce "wide" summary tables of data in this sort of format:

                                   ----   Centiles  ----
Param    Group   Mean       SD      25%     50%      75%
Height       1   x.xx    x.xxx     x.xx    x.xx     x.xx
             2   x.xx    x.xxx     x.xx    x.xx     x.xx
             3   x.xx    x.xxx     x.xx    x.xx     x.xx
Weight       1   x.xx    x.xxx     x.xx    x.xx     x.xx
             2   x.xx    x.xxx     x.xx    x.xx     x.xx
             3   x.xx    x.xxx     x.xx    x.xx     x.xx

我可以在dplyr 0.8.x中做到这一点。我可以使用一个可以处理具有任意数量的级别的任意分组变量和汇总具有任意名称的任意数量的变量的任意统计信息的函数来通用地执行此操作。通过将数据设置为 tidy

I can do that in dplyr 0.8.x. I can do it generically, with a function that can handle arbitrary grouping variables with arbitrary numbers of levels and arbitrary statistics summarising arbitrary numbers of variables with arbitrary names. I get that level of flexibility by making my data tidy. That's not what this question is about.

首先,一些玩具数据:

set.seed(123456)

toy <- tibble(
         Group=rep(1:3, each=5),
         Height=1.65 + rnorm(15, 0, 0.1),
         Weight= 75 + rnorm(15, 0, 10)
       ) %>% 
       pivot_longer(
         values_to="Value", 
         names_to="Parameter",
         cols=c(Height, Weight)
       )

现在,一个简单的汇总函数和一个辅助程序:

Now, a simple summary function, and a helper:

quibble2 <- function(x, q = c(0.25, 0.5, 0.75)) {
  tibble(Value := quantile(x, q), "Quantile" := q)
}

mySummary <- function(data, ...) {
  data %>% 
    group_by(Parameter, Group) %>% 
    summarise(..., .groups="drop")
}

所以我可以说

summary <- mySummary(toy, Q=quibble2(Value), Mean=mean(Value, na.rm=TRUE), SD=sd(Value, na.rm=TRUE))
summary %>% head()

捐赠

# A tibble: 6 x 5
  Parameter Group Q$Value $Quantile  Mean     SD
  <chr>     <int>   <dbl>     <dbl> <dbl>  <dbl>
1 Height        1    1.45      0.25  1.54 0.141 
2 Height        1    1.49      0.5   1.54 0.141 
3 Height        1    1.59      0.75  1.54 0.141 
4 Height        2    1.64      0.25  1.66 0.0649
5 Height        2    1.68      0.5   1.66 0.0649
6 Height        2    1.68      0.75  1.66 0.0649

这就是摘要I需要,但格式较长。而 Q df-col 。这是一个小问题:

So that's the summary I need, but it's in long format. And Q is a df-col. It's a tibble:

is_tibble(summary$Q)
[1] TRUE

所以 pivot_wider 似乎不起作用。我可以使用 nest_by()来获取每组一行的格式:

So pivot_wider doesn't seem to work. I can use nest_by() to get to a one-row-per-group format:

toySummary <- summary %>% nest_by(Group, Mean, SD)
toySummary
# Rowwise:  Group, Mean, SD
  Group  Mean      SD               data
  <int> <dbl>   <dbl> <list<tbl_df[,2]>>
1     1  1.54  0.141             [3 × 2]
2     1 78.8  10.2               [3 × 2]
3     2  1.66  0.0649            [3 × 2]
4     2 82.9   9.09              [3 × 2]
5     3  1.63  0.100             [3 × 2]
6     3 71.0  10.8               [3 × 2]

但是现在百分位数的格式更加复杂:

But now the format of the centiles is even more complicated:

> toySummary$data[1]
<list_of<
  tbl_df<
    Parameter: character
    Q        : 
      tbl_df<
        Value   : double
        Quantile: double
      >
  >
>[1]>
[[1]]
# A tibble: 3 x 2
  Parameter Q$Value $Quantile
  <chr>       <dbl>     <dbl>
1 Height       1.45      0.25
2 Height       1.49      0.5 
3 Height       1.59      0.75

它看起来像一个列表,所以我想某种形式的 lapply 可能会起作用,但是有没有整洁,整洁,尚未发现的解决方案?在研究此问题时,我发现了几个我不知道的新动词( chop pack rowwise() nest_by 等),但似乎都没有给我我想要的东西:理想情况下, tibble 有6行(由唯一的 Group Parameter 组合定义)和列平均值 SD Q25 Q50 Q75

It looks like a list, so I guess some form of lapply would probably work, but is there a neater, tidy, solution that I've not spotted yet? I've discovered several new verbs that I didn't know abou whilst researching this question (chop, pack, rowwise(), nest_by and such) but none seem to give me what I want: ideally, a tibble with 6 rows (defined by unique Group and Parameter combinations) and columns for Mean, SD, Q25, Q50 and Q75.

针对前两个提案进行澄清答案:获得我的玩具示例生成的确切数字比找到通用技术来从 df-col 移走不那么重要。 summary dplyr v1.0.0中返回我的示例说明的一般形式的广泛数据摘要。

To clarify in response to the first two proposed answers: getting the exact numbers that my toy example generates is less important than finding a generic technique for moving from the df-col(s) that summarise returns in dplyr v1.0.0 to a wide data summary of the general form that my example illustrates.

推荐答案

修订后的答案

这是我的修订版回答。这次,我用 enframe pivot_wider quibble2 函数>,以便返回带有三行的 tibble

Here is my revised answer. This time, I rewrote your quibble2 function with enframe and pivot_wider so that it returns a tibble with three rows.

这将再次导致 df-col 在您的摘要 小贴士中,现在我们可以使用直接拆包,而无需使用 pivot_wider 来获得预期的结果。

This will again lead to a df-col in your summary tibble, and now we can use unpack directly, without using pivot_wider to get the expected outcome.

library(tidyverse)

set.seed(123456)

toy <- tibble(
  Group=rep(1:3, each=5),
  Height=1.65 + rnorm(15, 0, 0.1),
  Weight= 75 + rnorm(15, 0, 10)
) %>% 
  pivot_longer(
    values_to="Value", 
    names_to="Parameter",
    cols=c(Height, Weight)
  )

quibble2 <- function(x, q = c(0.25, 0.5, 0.75)) {
  pivot_wider(enframe(quantile(x, q)),
              names_from = name,
              values_from = value) 
}

mySummary <- function(data, ...) {
  data %>% 
    group_by(Parameter, Group) %>% 
    summarise(..., .groups="drop")
}

summary <- mySummary(toy, Q=quibble2(Value), Mean=mean(Value, na.rm=TRUE), SD=sd(Value, na.rm=TRUE))

summary %>% 
  unpack(Q)
#> # A tibble: 6 x 7
#>   Parameter Group `25%` `50%` `75%`  Mean    SD
#>   <chr>     <int> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 Height        1  1.62  1.66  1.73  1.70 0.108
#> 2 Height        2  1.73  1.77  1.78  1.76 0.105
#> 3 Height        3  1.55  1.64  1.76  1.65 0.109
#> 4 Weight        1 75.6  80.6  84.3  80.0  9.05 
#> 5 Weight        2 75.4  76.9  79.6  77.4  7.27 
#> 6 Weight        3 70.7  75.2  82.0  76.3  6.94

在2020-06-13由< a href = https://reprex.tidyverse.org rel = nofollow noreferrer> reprex软件包(v0.3.0)



第二种方法
,而无需更改 quibble2 ,我们需要先调用 unpack 然后 pivot_wider

Created on 2020-06-13 by the reprex package (v0.3.0)

Second approach without changing quibble2, we would need to first call unpack and then pivot_wider. This should scale as well.

library(tidyverse)

set.seed(123456)

toy <- tibble(
  Group=rep(1:3, each=5),
  Height=1.65 + rnorm(15, 0, 0.1),
  Weight= 75 + rnorm(15, 0, 10)
) %>% 
  pivot_longer(
    values_to="Value", 
    names_to="Parameter",
    cols=c(Height, Weight)
  )

quibble2 <- function(x, q = c(0.25, 0.5, 0.75)) {
  tibble(Value := quantile(x, q), "Quantile" := q)
}

mySummary <- function(data, ...) {
  data %>% 
    group_by(Parameter, Group) %>% 
    summarise(..., .groups="drop")
}

summary <- mySummary(toy, Q=quibble2(Value), Mean=mean(Value, na.rm=TRUE), SD=sd(Value, na.rm=TRUE))

summary %>% 
  unpack(Q) %>% 
  pivot_wider(names_from = Quantile, values_from = Value)
#> # A tibble: 6 x 7
#>   Parameter Group  Mean    SD `0.25` `0.5` `0.75`
#>   <chr>     <int> <dbl> <dbl>  <dbl> <dbl>  <dbl>
#> 1 Height        1  1.70 0.108   1.62  1.66   1.73
#> 2 Height        2  1.76 0.105   1.73  1.77   1.78
#> 3 Height        3  1.65 0.109   1.55  1.64   1.76
#> 4 Weight        1 80.0  9.05   75.6  80.6   84.3 
#> 5 Weight        2 77.4  7.27   75.4  76.9   79.6 
#> 6 Weight        3 76.3  6.94   70.7  75.2   82.0

在2020-06-13由< a href = https://reprex.tidyverse.org rel = nofollow noreferrer> reprex软件包(v0.3.0)



通用方法


我试图通过重写 mySummary 函数找出一种更通用的方法。现在,它将自动将这些输出转换为 df-cols ,它们返回一个向量或一个命名向量。如有必要,它还会自动将 list 环绕在表达式周围。

Created on 2020-06-13 by the reprex package (v0.3.0)

generalized approach
I tried to figure out a more general approach by rewriting the mySummary function. Now it will convert automatically those outputs to df-cols which return a vector or a named vector. It will also wrap list automatically around expressions if necessary.

然后,我定义了一个函数扩大,这将通过保留行(包括调用 broom :: tidy )来尽可能地扩大 df 支持的列表列上的code>。

Then, I defined a function widen which will widen the df as much as possible, by preserving rows, including calling broom::tidy on supported list-columns.

这种方法并不完美,可以通过在 widen中包含 unnest_wider 来扩展函数。

The approach is not perfect, and could be extended by including unnest_wider in the widen function.

请注意,我更改了示例中的分组以能够使用 t.test 作为另一个示例输出。

Note, that I changed the grouping in the example to be able to use t.test as another example output.

library(tidyverse)
set.seed(123456)

toy <- tibble(
  Group=rep(1:3, each=5),
  Height=1.65 + rnorm(15, 0, 0.1),
  Weight= 75 + rnorm(15, 0, 10)
) %>% 
  pivot_longer(
    values_to="Value", 
    names_to="Parameter",
    cols=c(Height, Weight)
  )

# modified summary function
mySummary <- function(data, ...) {

  fns <- rlang::enquos(...)

  fns <- map(fns, function(x) {

    res <- rlang::eval_tidy(x, data = data)

    if ( ((is.vector(res)  || is.factor(res)) && length(res) == 1) ||
         ("list" %in% class(res) && is.list(res)) ||
           rlang::call_name(rlang::quo_get_expr(x)) == "list") {
      x
    }
    else if ((is.vector(res)  || is.factor(res)) && length(res) > 1) {
      x_expr <- as.character(list(rlang::quo_get_expr(x)))
      x_expr <- paste0(
        "pivot_wider(enframe(",
        x_expr,
        "), names_from = name, values_from = value)"
      )
      x <- rlang::quo_set_expr(x, str2lang(x_expr))

      x
    } else {
      x_expr <- as.character(list(rlang::quo_get_expr(x)))
      x_expr <- paste0("list(", x_expr,")")
      x <- rlang::quo_set_expr(x, str2lang(x_expr))

      x
    }
  })

  data %>% 
    group_by(Parameter) %>%
    summarise(!!! fns, .groups="drop")
}


# A function to automatically widen the df as much as possible while preserving rows
widen <- function(df) {

  df_cols <- names(df)[map_lgl(df, is.data.frame)]
  df <- unpack(df, all_of(df_cols), names_sep = "_")

  try_tidy <- function(x) {
    tryCatch({
      broom::tidy(x)
    }, error = function(e) {
      x
    })
  }

  df <- df %>% rowwise() %>% mutate(across(where(is.list), try_tidy))
  ungroup(df)
}

# if you want to specify function arguments for convenience use purrr::partial
quantile3 <- partial(quantile, x = , q = c(.25, .5, .75))

summary <- mySummary(toy,
                     Q = quantile3(Value),
                     R = range(Value),
                     T_test = t.test(Value),
                     Mean = mean(Value, na.rm=TRUE),
                     SD = sd(Value, na.rm=TRUE)
)

summary 
#> # A tibble: 2 x 6
#>   Parameter Q$`0%` $`25%` $`50%` $`75%` $`100%` R$`1`  $`2` T_test   Mean    SD
#>   <chr>      <dbl>  <dbl>  <dbl>  <dbl>   <dbl> <dbl> <dbl> <list>  <dbl> <dbl>
#> 1 Height      1.54   1.62   1.73   1.77    1.90  1.54  1.90 <htest>  1.70 0.109
#> 2 Weight     67.5   72.9   76.9   83.2    91.7  67.5  91.7  <htest> 77.9  7.40

widen(summary)
#> # A tibble: 2 x 11
#>   Parameter `Q_0%` `Q_25%` `Q_50%` `Q_75%` `Q_100%`   R_1   R_2 T_test$estimate
#>   <chr>      <dbl>   <dbl>   <dbl>   <dbl>    <dbl> <dbl> <dbl>           <dbl>
#> 1 Height      1.54    1.62    1.73    1.77     1.90  1.54  1.90            1.70
#> 2 Weight     67.5    72.9    76.9    83.2     91.7  67.5  91.7            77.9 
#> # … with 9 more variables: $statistic <dbl>, $p.value <dbl>, $parameter <dbl>,
#> #   $conf.low <dbl>, $conf.high <dbl>, $method <chr>, $alternative <chr>,
#> #   Mean <dbl>, SD <dbl>

由reprex软件包(v0.3.0)于2020-06-14创建

这篇关于取消嵌套的标题栏:“宽” dplyr v1.0.0的数据摘要的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆