等价于`table,contents()`Stata命令,用于摘要统计 [英] R equivalent of `table ,contents( )` Stata command for summary statistics

查看:118
本文介绍了等价于`table,contents()`Stata命令,用于摘要统计的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我试图模仿R中的 table Stata命令,该命令执行汇总统计信息表.该命令允许您在结果单元格内部创建具有各种统计信息的交叉表.例如,在下面的示例中,我交叉了三个变量( category1 category2 category3 ),并将均值作为列向量和 metric1 的标准偏差以及均值和标准偏差 metric2 的平均值.

I am trying to mimic the table Stata command in R, which performs summary statistics tables. The command allows you to create cross tables with diverse statistics inside of the resulting cells. For instance, in my example below, I am crossing three variables (category1, category2, and category3) and getting as a column vector the mean and standard deviation of metric1 and the mean and standard deviation metric2.

陈述的行为是通过在Stata上的以下单行获得的.

The stated behavior is obtained with the following single line on Stata.

table category1 category2 category3 ,c(mean metric1 sd metric1 mean metric2 sd metric2) 

所需的输出:表的说明.

这里是生成的交叉表的每个列向量,假设交叉表的 X 包含 X = [mean(metric1),sd(metric1),mean(metric2),sd(metric2)]'

----------------------------------------------------------------------------
          |                     category3 and category2                     
          | ------------ First -----------    ----------- Second -----------
category1 |      A        B       C   Total         A       B       C   Total
----------+-----------------------------------------------------------------
        1 |  mean(metric1)  
          |  sd(metric1)  
          |  mean(metric2)  
          |  sd(metric1)   

所需的输出(!):Stata上的结果表.


----------------------------------------------------------------------------
          |                     category3 and category2                     
          | ------------ First -----------    ----------- Second -----------
category1 |      A       B       C   Total         A       B       C   Total
----------+-----------------------------------------------------------------
        1 |  5.778   7.200   2.571   5.048     6.667   3.000   3.000   4.222
          |  2.906   3.347   2.507   3.324     2.309   1.414   1.155   2.333
          | -1.556  -2.000  -1.143  -1.524    -2.000  -2.000  -3.000  -2.444
          |  1.667   0.000   1.069   1.250     0.000   2.828   1.155   1.333
          | 
        2 |  3.200   6.333   4.200   4.571     4.889   5.000   5.000   4.947
          |  2.280   3.445   2.741   2.976     3.180   3.464   2.449   2.857
          | -0.800  -2.000  -2.000  -1.714    -2.222  -1.500  -1.000  -1.684
          |  1.095   1.265   1.333   1.309     1.563   1.000   1.673   1.529
          | 
        3 |  8.667   4.667   5.167   5.667     5.667   6.667   6.000   6.000
          |  2.309   2.309   2.758   2.849     3.445   4.163   3.464   3.303
          | -3.333  -2.667  -2.000  -2.333    -2.333  -2.000  -1.333  -2.000
          |  1.155   1.155   1.477   1.414     0.816   2.000   1.155   1.206
          | 
    Total |  5.529   6.286   4.207   5.067     5.444   5.111   4.615   5.100
          |  3.125   3.124   2.795   3.047     3.053   3.333   2.501   2.898
          | -1.647  -2.143  -1.793  -1.833    -2.222  -1.778  -1.692  -1.950
          |  1.618   0.949   1.346   1.342     1.166   1.563   1.601   1.395
----------------------------------------------------------------------------

生成上述结果的状态代码.

clear all
set obs 100

set seed 777
gen category1 = runiformint(1,3)
gen category2_num = runiformint(1,3)

gen category2 = "A" if category2_num ==1
replace category2 = "B" if category2_num ==2
replace category2 = "C" if category2_num ==3

drop category2_num

gen category3_num = runiformint(1,2)
gen category3 = "First" if category3_num ==1
replace category3 = "Second" if category3_num ==2

drop category3_num

gen metric1 = round(runiform()*10,2)
gen metric2 = round(runiform()*-4,2)

table category1 category2 category3 /// List of the variables that will create the crosstab
         ,c(mean metric1 sd metric1 /// Mean and std.dev of metric1 as 1st and 2nd rows
         mean metric2 sd metric2)   /// Mean and std.dev of metric2 as 3rd and 4th rows
         row col                    /// Add the over all statistics total rows and cols
         format(%9.3f)              // Decimal style setting.

R尝试.

这是我解决问题的方式.但是,我仍然远远没有达到我想要的结果.即使我在屏幕上显示了相同的信息,但在R上显示的方式的可读性仍然很差.此外,我还没有计算行和列的总数的均值和标准差,Stata输出.

R attempt.

Here is how I have tackled the problem. However, I am still far from my desired results. Even though I have the same information displayed on the screen, the readability is very poor in the way I am presenting it on R. Additionally, I haven't computed the mean and standard deviation for the total of rows and columns and I did on the Stata output.

最后,在我看来,此过程似乎是解决此类简单问题的一种过分解决方案.在我的上下文中,允许使用软件包,因此,欢迎使用 dplyr data.table 建议.

Finally, in my opinion, this procedure seems like an overkill solution for such a simple problem. In my context packages are allowed, hence, dplyr or data.table suggestions are welcome.

df <- as.data.frame(structure(list(category1 = structure(c(1, 3, 1, 2, 3, 1, 3, 1,1, 2, 2, 2, 2, 2, 2, 2, 3, 2, 2, 1, 3, 1, 3, 3, 1, 3, 2, 2, 2, 1, 1, 2, 1, 2, 2, 1, 3, 3, 2, 2, 2, 3, 1, 2, 3, 2, 3, 2, 2, 1,3, 3, 3, 2, 2, 1, 1, 1, 3, 2, 3, 1, 2, 2, 1, 3, 1, 3, 1, 1, 3,1, 1, 2, 1, 3, 2, 2, 3, 3, 3, 1, 2, 3, 2, 3, 2, 1, 1, 1, 2, 2,2, 1, 3, 2, 2, 2, 3, 3), format.stata = "%9.0g"), 
                    category2 = structure(c("C", "A", "A", "A", "C", "C", "A", "A", "A", "A", "B", "A", "A", "A","A", "B", "A", "C", "C", "B", "C", "A", "A", "C", "A", "B", "C", "B", "C", "C", "A", "C", "B", "B", "A", "B", "C", "A", "B", "B","C", "A", "A", "C", "C", "B", "C", "A", "A", "C", "C", "B", "C", "C", "A", "C", "A", "A", "C", "B", "A", "C", "C", "C", "B", "B","C", "C", "A", "A", "C", "C", "A", "C", "B", "B", "C", "C", "C", "C", "A", "C", "C", "C", "C", "B", "B", "B", "B", "C", "A", "A","C", "C", "A", "A", "A", "B", "B", "C"), format.stata = "%9s"), 
                    category3 = structure(c("First", "Second", "First", "First", "First", "First", "Second", "Second", "First", "Second", "First", "First", "Second", "Second", "First", "Second", "Second", "First", "Second", "First", "First", "First", "First","Second", "First", "First", "Second", "First", "First", "First","First", "First", "First", "Second", "First", "First", "First", "Second", "First", "First", "First", "Second", "First", "First","Second", "Second", "First", "Second", "Second", "Second","First", "First", "First", "Second", "Second", "First", "First","Second", "First", "First", "First", "First", "Second", "First","Second", "Second", "First", "Second", "First", "Second", "First", "Second", "First", "First", "First", "First", "Second","First", "First", "First", "Second", "Second", "First", "First","First", "Second", "First", "Second", "First", "Second","Second", "First", "Second", "First", "First", "Second","Second", "Second", "Second", "First"), format.stata = "%9s"),
                    metric1 = structure(c(0, 10, 0, 0, 8, 4, 4, 8, 8, 2, 4, 4, 6, 2, 6, 8, 6, 4, 4, 10, 10, 4, 6, 8, 6, 2, 4, 4, 6, 0, 6,0, 10, 8, 2, 2, 2, 0, 2, 10, 2, 8, 4, 6, 8, 2, 2, 6, 0, 2,4, 6, 2, 2, 8, 6, 8, 8, 2, 8, 10, 4, 4, 4, 4, 10, 4, 2, 6,4, 6, 4, 10, 2, 8, 6, 8, 2, 6, 6, 6, 4, 8, 6, 8, 2, 10, 2, 6, 2, 10, 4, 8, 0, 10, 6, 4, 2, 8, 8), format.stata = "%9.0g"),
                    metric2 = structure(c(0, -4, 0, 0, -2, -2, -2, -2, -4, -2, -2, -2, -2, -4, 0, 0, -2, -2, -4, -2, 0, -2, -4, -2, -2, -2, -2, -2, -4, 0, -4, -4, -2, -2, -2, -2, -2, -2, -4, -2, -2, -2, -2, -2, 0, -2, -4, -4, -2, -2, 0, -4, -2, 0, -2,-2, 0, -2, -4, 0, -2, -2, 0, 0, -4, -4, 0, -2, 0, -2, -2, -4, 0, -2, -2, -2, 0, -2, -2, -2, -2, -2, -2, 0, 0, 0, -2, 0, -2, -4, 0, 0, 0, -2, -4, -4, 0, -2, -2, -4), format.stata = "%9.0g")), 
               row.names = c(NA,-100L), class = c("tbl_df", "tbl", "data.frame")))

# expand grid for every possible value
prs <- expand.grid(cat1 = unique(df$category1)   ,
                   cat2 = unique(df$category2) ,
                   cat3 = unique(df$category3))

#Number of total combinations 
N <-   nrow(prs)
#Loop over the combinations to get the desired statistis
A <- lapply(1:N, FUN = function(i){
      mean1 <- mean(df[(df$category1 == prs$cat1[i] &  df$category2 == prs$cat2[i] & df$category3 == prs$cat3[i] ), "metric1"])
      sd1   <- sd(df[(df$category1 == prs$cat1[i] &  df$category2 == prs$cat2[i] & df$category3 == prs$cat3[i] ), "metric1"])
        
      mean2 <- mean(df[(df$category1 == prs$cat1[i] &  df$category2 == prs$cat2[i] & df$category3 == prs$cat3[i] ), "metric2"])
      sd2   <- sd(df[(df$category1 == prs$cat1[i] &  df$category2 == prs$cat2[i] & df$category3 == prs$cat3[i] ), "metric2"])
        
      r_list<- list(cat1 = prs$cat1[i],cat2 = prs$cat2[i], cat3 = prs$cat3[i],
                    mean1 = mean1,  sd1 = sd1 , mean2 = mean2, sd2 = sd2)
  return(r_list)
})

#List to data.frame
df_stats <- do.call(rbind.data.frame, A)

获得的输出(但是,不是我想要的输出(!))

Obtained output (but, not my desired output (!) )

# cat1 cat2   cat3    mean1      sd1     mean2       sd2
# 2     1    C  First 2.571429 2.507133 -1.142857 1.0690450
# 21    3    C  First 5.166667 2.757909 -2.000000 1.4770979
# 3     2    C  First 4.200000 2.740641 -2.000000 1.3333333
# 4     1    A  First 5.777778 2.905933 -1.555556 1.6666667
# 5     3    A  First 8.666667 2.309401 -3.333333 1.1547005
# 6     2    A  First 3.200000 2.280351 -0.800000 1.0954451
# 7     1    B  First 7.200000 3.346640 -2.000000 0.0000000
# 8     3    B  First 4.666667 2.309401 -2.666667 1.1547005
# 9     2    B  First 6.333333 3.444803 -2.000000 1.2649111
# 10    1    C Second 3.000000 1.154701 -3.000000 1.1547005
# 11    3    C Second 6.000000 3.464102 -1.333333 1.1547005
# 12    2    C Second 5.000000 2.449490 -1.000000 1.6733201
# 13    1    A Second 6.666667 2.309401 -2.000000 0.0000000
# 14    3    A Second 5.666667 3.444803 -2.333333 0.8164966
# 15    2    A Second 4.888889 3.179797 -2.222222 1.5634719
# 16    1    B Second 3.000000 1.414214 -2.000000 2.8284271
# 17    3    B Second 6.666667 4.163332 -2.000000 2.0000000
# 18    2    B Second 5.000000 3.464102 -1.500000 1.0000000

推荐答案

您可以如下使用 data.table magrittr 包:

library(magrittr)
library(data.table)

# function to compute the mean and sd
fun <- function(x, y) list(metric1_meam=mean(x), metric1_sd=sd(x), metric2_meam=mean(y), metric2_sd=sd(y))

# compute the Total column, and A,B,C columns of the desired output as follows and bind them 

setDT(df)[, 'category1' := as.character(category1)]

Y <- rbind(
  df[, fun(metric1, metric2), by=.(category1, category2, category3)],
  df[, fun(metric1, metric2), by=.(category1, category3)][, category2 := 'Total'],
  df[, fun(metric1, metric2), by=.(category2, category3)][, category1 := 'Total'],
  df[, fun(metric1, metric2), by=.(category3)][, c('category1', 'category2') := 'Total']
)

# generate the desired output
melt(Y, measure=patterns('metric')) %>% 
  xtabs(formula = value ~ .) %>% 
  ftable(col.vars = c('category3', 'category2'))





                       category3      First                                      Second                                 
                       category2          A          B          C      Total          A          B          C      Total
category1 variable                                                                                                      
1         metric1_meam            5.7777778  7.2000000  2.5714286  5.0476190  6.6666667  3.0000000  3.0000000  4.2222222
          metric1_sd              2.9059326  3.3466401  2.5071327  3.3237959  2.3094011  1.4142136  1.1547005  2.3333333
          metric2_meam           -1.5555556 -2.0000000 -1.1428571 -1.5238095 -2.0000000 -2.0000000 -3.0000000 -2.4444444
          metric2_sd              1.6666667  0.0000000  1.0690450  1.2497619  0.0000000  2.8284271  1.1547005  1.3333333
2         metric1_meam            3.2000000  6.3333333  4.2000000  4.5714286  4.8888889  5.0000000  5.0000000  4.9473684
          metric1_sd              2.2803509  3.4448028  2.7406406  2.9760952  3.1797973  3.4641016  2.4494897  2.8572264
          metric2_meam           -0.8000000 -2.0000000 -2.0000000 -1.7142857 -2.2222222 -1.5000000 -1.0000000 -1.6842105
          metric2_sd              1.0954451  1.2649111  1.3333333  1.3093073  1.5634719  1.0000000  1.6733201  1.5294382
3         metric1_meam            8.6666667  4.6666667  5.1666667  5.6666667  5.6666667  6.6666667  6.0000000  6.0000000
          metric1_sd              2.3094011  2.3094011  2.7579087  2.8491485  3.4448028  4.1633320  3.4641016  3.3028913
          metric2_meam           -3.3333333 -2.6666667 -2.0000000 -2.3333333 -2.3333333 -2.0000000 -1.3333333 -2.0000000
          metric2_sd              1.1547005  1.1547005  1.4770979  1.4142136  0.8164966  2.0000000  1.1547005  1.2060454
Total     metric1_meam            5.5294118  6.2857143  4.2068966  5.0666667  5.4444444  5.1111111  4.6153846  5.1000000
          metric1_sd              3.1248529  3.1238185  2.7951400  3.0469027  3.0529103  3.3333333  2.5012817  2.8982753
          metric2_meam           -1.6470588 -2.1428571 -1.7931034 -1.8333333 -2.2222222 -1.7777778 -1.6923077 -1.9500000
          metric2_sd              1.6179144  0.9492623  1.3464055  1.3424827  1.1659662  1.5634719  1.6012815  1.3950462

这篇关于等价于`table,contents()`Stata命令,用于摘要统计的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆