包含因子和连续变量的汇总统计表 [英] Summary Statistics table with factors and continuous variables

查看:41
本文介绍了包含因子和连续变量的汇总统计表的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试创建一个简单的汇总统计表(最小值、最大值、平均值、n 等),该表可以同时处理因子变量和连续变量,即使存在多个因子变量也是如此.我正在尝试生成漂亮的 HTML 输出,例如 stargazerhuxtable 输出.

对于一个简单的可重现示例,我将使用 mtcars 但将两个变量更改为因子,并简化为三个变量.

图书馆(tidyverse)图书馆(观星者)mtcars_df <- mtcarsmtcars_df <- mtcars_df %>%变异(vs = 因子(vs),am = 因子(am)) %>%选择(mpg,vs,上午)头(mtcars_df)

所以数据有两个因子变量,vsam.mpg 保留为双精度:

#>mpg 与 am#><dbl><fctr><fctr>#>1 21.0 0 1#>2 21.0 0 1#>3 22.8 1 1#>4 21.4 1 0#>5 18.7 0 0#>6 18.1 1 0

我想要的输出看起来像这样(仅格式,am0 的数字并不完全正确):

======================================================统计 N 均值 St. Dev.最小 Pctl(25) Pctl(75) 最大------------------------------------------------------英里/加仑 32 20.091 6.027 10 15.4 22.8 34vs0 32 0.562 0.504 0 0 1 1vs1 32 0.438 0.504 0 0 1 1am0 32 0.594 0.499 0 0 1 1am1 32 0.406 0.499 0 0 1 1------------------------------------------------------

直接调用 stargazer 不处理因素(但我们有一个解决方案来总结一个因素,如下)

# 这不会给出因数观星者(mtcars_df,类型=文本")

======================================================统计 N 均值 St. Dev.最小 Pctl(25) Pctl(75) 最大------------------------------------------------------英里/加仑 32 20.091 6.027 10 15.4 22.8 34------------------------------------------------------

@jake-fisher 之前的回答非常适合总结一个因子变量.https://stackoverflow.com/a/26935270/8742237

上一个答案中的以下代码给出了第一个因子 vs 的两个值,即 vs0vs1 但是当涉及到第二个因素,am,它只列出am 的一个值的汇总统计:

  • am0 缺失.

我确实意识到这是因为我们想在建模时避免虚拟变量陷阱,但我的问题不在于建模,而是关于创建一个包含所有因子变量的所有值的汇总表.

options(na.action = "na.pass") # 以便我们保留数据中的缺失值X <- model.matrix(~ .- 1, data = mtcars_df)X.df <- data.frame(X) # stargazer只做data.frame对象的汇总表#names(X) <- colnames(X)观星者(X.df,类型=文本")

<预><代码>======================================================统计 N 均值 St. Dev.最小 Pctl(25) Pctl(75) 最大------------------------------------------------------英里/加仑 32 20.091 6.027 10 15.4 22.8 34vs0 32 0.562 0.504 0 0 1 1vs1 32 0.438 0.504 0 0 1 1am1 32 0.406 0.499 0 0 1 1------------------------------------------------------

虽然使用 stargazerhuxtable 是首选,但如果有更简单的方法使用不同的库生成此类汇总表,那仍然会非常有帮助.

解决方案

最后,而不是使用 model.matrix()被设计创建虚拟变量时的基本情况,一个简单的解决方法是使用 mlr::createDummyFeatures(),它为所有值创建一个虚拟变量,甚至是基本情况.

图书馆(tidyverse)图书馆(观星者)图书馆(毫升)mtcars_df <- mtcarsmtcars_df <- mtcars_df %>%变异(vs = 因子(vs),am = 因子(am)) %>%选择(mpg,vs,上午)头(mtcars_df)X <- mlr::createDummyFeatures(obj = mtcars_df)X.df <- data.frame(X) # stargazer只做data.frame对象的汇总表#names(X) <- colnames(X)观星者(X.df,类型=文本")

确实提供了所需的输出:

======================================================统计 N 均值 St. Dev.最小 Pctl(25) Pctl(75) 最大------------------------------------------------------英里/加仑 32 20.091 6.027 10 15.4 22.8 34vs.0 32 0.562 0.504 0 0 1 1vs.1 32 0.438 0.504 0 0 1 1上午 0 32 0.594 0.499 0 0 1 1am.1 32 0.406 0.499 0 0 1 1------------------------------------------------------

I am trying to create a simple summary statistics table (min, max, mean, n, etc) that handles both factor variables and continuous variables, even when there is more than one factor variable. I'm trying to produce good looking HTML output, eg stargazer or huxtable output.

For a simple reproducible example, I'll use mtcars but change two of the variables to factors, and simplify to three variables.

library(tidyverse)
library(stargazer)

mtcars_df <- mtcars
mtcars_df <- mtcars_df %>% 
  mutate(vs = factor(vs),
         am = factor(am)) %>% 
  select(mpg, vs, am)
head(mtcars_df)

So the data has two factor variables, vs and am. mpg is left as a double:

#>    mpg vs am
#>  <dbl> <fctr> <fctr>
#> 1 21.0  0  1
#> 2 21.0  0  1
#> 3 22.8  1  1
#> 4 21.4  1  0
#> 5 18.7  0  0
#> 6 18.1  1  0

My desired output would look something like this (format only, the numbers aren't all correct for am0):

======================================================
Statistic N   Mean  St. Dev. Min Pctl(25) Pctl(75) Max
------------------------------------------------------
mpg       32 20.091  6.027   10    15.4     22.8   34 
vs0       32 0.562   0.504    0     0        1      1 
vs1       32 0.438   0.504    0     0        1      1 
am0       32 0.594   0.499    0     0        1      1 
am1       32 0.406   0.499    0     0        1      1 
------------------------------------------------------

A straight call to stargazer does not handle factors (but we have a solution for summarising one factor, below)

# this doesn't give factors
stargazer(mtcars_df, type = "text")

======================================================
Statistic N   Mean  St. Dev. Min Pctl(25) Pctl(75) Max
------------------------------------------------------
mpg       32 20.091  6.027   10    15.4     22.8   34 
------------------------------------------------------

This previous answer from @jake-fisher works very well to summarise one factor variable. https://stackoverflow.com/a/26935270/8742237

The code below from the previous answer gives both values of the first factor vs, i.e. vs0 and vs1 but when it comes to the second factor, am, it only lists summary statistics for one value of am:

  • am0 is missing.

I do realise that this is because we want to avoid the dummy variable trap when modeling, but my issue is not about modeling, it's about creating a summary table with all values of all factor variables.

options(na.action = "na.pass")  # so that we keep missing values in the data
X <- model.matrix(~ . - 1, data = mtcars_df)
X.df <- data.frame(X)  # stargazer only does summary tables of data.frame objects
#names(X) <- colnames(X)
stargazer(X.df, type = "text")


======================================================
Statistic N   Mean  St. Dev. Min Pctl(25) Pctl(75) Max
------------------------------------------------------
mpg       32 20.091  6.027   10    15.4     22.8   34 
vs0       32 0.562   0.504    0     0        1      1 
vs1       32 0.438   0.504    0     0        1      1 
am1       32 0.406   0.499    0     0        1      1 
------------------------------------------------------

While use of stargazer or huxtable would be preferred, if there's an easier way to produce this sort of summary table with a different library, that would still be very helpful.

解决方案

In the end, instead of using model.matrix(), which is designed to drop the base case when creating dummy variables, a simple fix is to use mlr::createDummyFeatures(), which creates a Dummy for all values, even the base case.

library(tidyverse)
library(stargazer)
library(mlr)

mtcars_df <- mtcars
mtcars_df <- mtcars_df %>% 
  mutate(vs = factor(vs),
         am = factor(am)) %>% 
  select(mpg, vs, am)
head(mtcars_df)


X <- mlr::createDummyFeatures(obj = mtcars_df)
X.df <- data.frame(X)  # stargazer only does summary tables of data.frame objects
#names(X) <- colnames(X)
stargazer(X.df, type = "text")

which does give the desired output:

======================================================
Statistic N   Mean  St. Dev. Min Pctl(25) Pctl(75) Max
------------------------------------------------------
mpg       32 20.091  6.027   10    15.4     22.8   34 
vs.0      32 0.562   0.504    0     0        1      1 
vs.1      32 0.438   0.504    0     0        1      1 
am.0      32 0.594   0.499    0     0        1      1 
am.1      32 0.406   0.499    0     0        1      1 
------------------------------------------------------

这篇关于包含因子和连续变量的汇总统计表的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆