创建具有特定摘要统计信息的变量表 [英] Creating table of variables with specific summary statistics

查看:49
本文介绍了创建具有特定摘要统计信息的变量表的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我试图用以下格式创建所有数值变量(即特征)的表格:

I am trying to make a table of all my numerical variables (i.e. feature) in the following format:

Feature | Count | % Missing | Cardinality | Min. | 1st Quartile | Mean | Median | 3rd Quartile | Max. | Std. Dev. |

-------- | ------- | ----------- | ------------- || -----| -------------- | -------- || -------- | -------------- |------ | ----------- ||||||||||||

--------|-------|-----------|-------------|------|--------------|------|--------|--------------|------|-----------| | | | | | | | | | | |

因此,每一行表示一个特定的数字变量,而每一列表示上面显示的统计信息(计数,丢失百分比,基数,最小值,第一四分位数,平均值,中位数,第三四分位数,最大标准偏差)

So each row signifies a specific numeric variable and each column the statistics shown above (Count, % Missing, Cardinality, Min., 1st Quartile, Mean, Median, 3rd Quartile, Max. Std. Dev.)

假设我的数据集称为Mashable,而我的数值变量称为X,Y和Z.如何创建此表?

Say my dataset is called Mashable and my numerical variables are called X, Y and Z. How would I create this table?

提前谢谢!

推荐答案

如果您已经在使用 dplyr ,则可以使用长形数据和分组,并处理所需的所有功能作为总结.这样一来,您就可以轻松扩展,因此3个变量的工作流程与25或100个变量的工作流程相同.这也使得应用所需的任何功能都相对较快.

If you're using dplyr already, you can make use of long shaped data and grouping, and treat all the functions you need as summarizations. That lets you scale easily, so it's the same workflow for 3 variables as it is for 25 or 100. It also makes it relatively quick to apply whatever functions you want.

我用x,y和z制作了伪数据,然后将其绑定到几行 NA 上,只是为了显示缺失值计数.将其收集到长数据,按变量分组,然后使用所需的任何汇总函数.我开始为您命名的前几个.这将为您提供所需的格式.

I made dummy data with x, y, and z, then bound onto it a couple rows of NAs just to show the missing value count. Gather it to long data, group by the variable, then use whatever summary functions you want. I started out the first several you named. This gives you the format you asked for.

library(tidyverse)

tibble(
  x = rnorm(100, mean = 1, sd = 1),
  y = rnorm(100, mean = 10, sd = 1),
  z = rexp(100, rate = 0.01)
) %>%
  bind_rows(tibble(x = c(NA, NA), y = c(NA, NA), z = c(NA, NA))) %>%
  gather(key = variable, value = value) %>%
  group_by(variable) %>%
  summarise(
    count = n(),
    missing = sum(is.na(value)),
    share_missing = missing / count,
    mean = mean(value, na.rm = T),
    sd = sd(value, na.rm = T),
    q1 = quantile(value, 0.25, na.rm = T)
  )
#> # A tibble: 3 x 7
#>   variable count missing share_missing    mean     sd     q1
#>   <chr>    <int>   <int>         <dbl>   <dbl>  <dbl>  <dbl>
#> 1 x          102       2        0.0196   0.997  1.08   0.246
#> 2 y          102       2        0.0196   9.81   0.962  9.10 
#> 3 z          102       2        0.0196 106.    90.6   39.9

reprex软件包(v0.2.0)于2018-05-20创建.

Created on 2018-05-20 by the reprex package (v0.2.0).

这篇关于创建具有特定摘要统计信息的变量表的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆