等价于`table,contents()`Stata命令,用于摘要统计 [英] R equivalent of `table ,contents( )` Stata command for summary statistics
问题描述
我试图模仿R中的 table
Stata命令,该命令执行汇总统计信息表.该命令允许您在结果单元格内部创建具有各种统计信息的交叉表.例如,在下面的示例中,我交叉了三个变量( category1
, category2
和 category3
),并将均值作为列向量和 metric1
的标准偏差以及均值和标准偏差 metric2
的平均值.
I am trying to mimic the table
Stata command in R, which performs summary statistics tables. The command allows you to create cross tables with diverse statistics inside of the resulting cells. For instance, in my example below, I am crossing three variables (category1
, category2
, and category3
) and getting as a column vector the mean and standard deviation of metric1
and the mean and standard deviation metric2
.
陈述的行为是通过在Stata上的以下单行获得的.
The stated behavior is obtained with the following single line on Stata.
table category1 category2 category3 ,c(mean metric1 sd metric1 mean metric2 sd metric2)
所需的输出:表的说明.
这里是生成的交叉表的每个列向量,假设交叉表的 X
包含 X = [mean(metric1),sd(metric1),mean(metric2),sd(metric2)]'
----------------------------------------------------------------------------
| category3 and category2
| ------------ First ----------- ----------- Second -----------
category1 | A B C Total A B C Total
----------+-----------------------------------------------------------------
1 | mean(metric1)
| sd(metric1)
| mean(metric2)
| sd(metric1)
所需的输出(!):Stata上的结果表.
----------------------------------------------------------------------------
| category3 and category2
| ------------ First ----------- ----------- Second -----------
category1 | A B C Total A B C Total
----------+-----------------------------------------------------------------
1 | 5.778 7.200 2.571 5.048 6.667 3.000 3.000 4.222
| 2.906 3.347 2.507 3.324 2.309 1.414 1.155 2.333
| -1.556 -2.000 -1.143 -1.524 -2.000 -2.000 -3.000 -2.444
| 1.667 0.000 1.069 1.250 0.000 2.828 1.155 1.333
|
2 | 3.200 6.333 4.200 4.571 4.889 5.000 5.000 4.947
| 2.280 3.445 2.741 2.976 3.180 3.464 2.449 2.857
| -0.800 -2.000 -2.000 -1.714 -2.222 -1.500 -1.000 -1.684
| 1.095 1.265 1.333 1.309 1.563 1.000 1.673 1.529
|
3 | 8.667 4.667 5.167 5.667 5.667 6.667 6.000 6.000
| 2.309 2.309 2.758 2.849 3.445 4.163 3.464 3.303
| -3.333 -2.667 -2.000 -2.333 -2.333 -2.000 -1.333 -2.000
| 1.155 1.155 1.477 1.414 0.816 2.000 1.155 1.206
|
Total | 5.529 6.286 4.207 5.067 5.444 5.111 4.615 5.100
| 3.125 3.124 2.795 3.047 3.053 3.333 2.501 2.898
| -1.647 -2.143 -1.793 -1.833 -2.222 -1.778 -1.692 -1.950
| 1.618 0.949 1.346 1.342 1.166 1.563 1.601 1.395
----------------------------------------------------------------------------
生成上述结果的状态代码.
clear all
set obs 100
set seed 777
gen category1 = runiformint(1,3)
gen category2_num = runiformint(1,3)
gen category2 = "A" if category2_num ==1
replace category2 = "B" if category2_num ==2
replace category2 = "C" if category2_num ==3
drop category2_num
gen category3_num = runiformint(1,2)
gen category3 = "First" if category3_num ==1
replace category3 = "Second" if category3_num ==2
drop category3_num
gen metric1 = round(runiform()*10,2)
gen metric2 = round(runiform()*-4,2)
table category1 category2 category3 /// List of the variables that will create the crosstab
,c(mean metric1 sd metric1 /// Mean and std.dev of metric1 as 1st and 2nd rows
mean metric2 sd metric2) /// Mean and std.dev of metric2 as 3rd and 4th rows
row col /// Add the over all statistics total rows and cols
format(%9.3f) // Decimal style setting.
R尝试.
这是我解决问题的方式.但是,我仍然远远没有达到我想要的结果.即使我在屏幕上显示了相同的信息,但在R上显示的方式的可读性仍然很差.此外,我还没有计算行和列的总数的均值和标准差,Stata输出.
R attempt.
Here is how I have tackled the problem. However, I am still far from my desired results. Even though I have the same information displayed on the screen, the readability is very poor in the way I am presenting it on R. Additionally, I haven't computed the mean and standard deviation for the total of rows and columns and I did on the Stata output.
最后,在我看来,此过程似乎是解决此类简单问题的一种过分解决方案.在我的上下文中,允许使用软件包,因此,欢迎使用 dplyr
或 data.table
建议.
Finally, in my opinion, this procedure seems like an overkill solution for such a simple problem. In my context packages are allowed, hence, dplyr
or data.table
suggestions are welcome.
df <- as.data.frame(structure(list(category1 = structure(c(1, 3, 1, 2, 3, 1, 3, 1,1, 2, 2, 2, 2, 2, 2, 2, 3, 2, 2, 1, 3, 1, 3, 3, 1, 3, 2, 2, 2, 1, 1, 2, 1, 2, 2, 1, 3, 3, 2, 2, 2, 3, 1, 2, 3, 2, 3, 2, 2, 1,3, 3, 3, 2, 2, 1, 1, 1, 3, 2, 3, 1, 2, 2, 1, 3, 1, 3, 1, 1, 3,1, 1, 2, 1, 3, 2, 2, 3, 3, 3, 1, 2, 3, 2, 3, 2, 1, 1, 1, 2, 2,2, 1, 3, 2, 2, 2, 3, 3), format.stata = "%9.0g"),
category2 = structure(c("C", "A", "A", "A", "C", "C", "A", "A", "A", "A", "B", "A", "A", "A","A", "B", "A", "C", "C", "B", "C", "A", "A", "C", "A", "B", "C", "B", "C", "C", "A", "C", "B", "B", "A", "B", "C", "A", "B", "B","C", "A", "A", "C", "C", "B", "C", "A", "A", "C", "C", "B", "C", "C", "A", "C", "A", "A", "C", "B", "A", "C", "C", "C", "B", "B","C", "C", "A", "A", "C", "C", "A", "C", "B", "B", "C", "C", "C", "C", "A", "C", "C", "C", "C", "B", "B", "B", "B", "C", "A", "A","C", "C", "A", "A", "A", "B", "B", "C"), format.stata = "%9s"),
category3 = structure(c("First", "Second", "First", "First", "First", "First", "Second", "Second", "First", "Second", "First", "First", "Second", "Second", "First", "Second", "Second", "First", "Second", "First", "First", "First", "First","Second", "First", "First", "Second", "First", "First", "First","First", "First", "First", "Second", "First", "First", "First", "Second", "First", "First", "First", "Second", "First", "First","Second", "Second", "First", "Second", "Second", "Second","First", "First", "First", "Second", "Second", "First", "First","Second", "First", "First", "First", "First", "Second", "First","Second", "Second", "First", "Second", "First", "Second", "First", "Second", "First", "First", "First", "First", "Second","First", "First", "First", "Second", "Second", "First", "First","First", "Second", "First", "Second", "First", "Second","Second", "First", "Second", "First", "First", "Second","Second", "Second", "Second", "First"), format.stata = "%9s"),
metric1 = structure(c(0, 10, 0, 0, 8, 4, 4, 8, 8, 2, 4, 4, 6, 2, 6, 8, 6, 4, 4, 10, 10, 4, 6, 8, 6, 2, 4, 4, 6, 0, 6,0, 10, 8, 2, 2, 2, 0, 2, 10, 2, 8, 4, 6, 8, 2, 2, 6, 0, 2,4, 6, 2, 2, 8, 6, 8, 8, 2, 8, 10, 4, 4, 4, 4, 10, 4, 2, 6,4, 6, 4, 10, 2, 8, 6, 8, 2, 6, 6, 6, 4, 8, 6, 8, 2, 10, 2, 6, 2, 10, 4, 8, 0, 10, 6, 4, 2, 8, 8), format.stata = "%9.0g"),
metric2 = structure(c(0, -4, 0, 0, -2, -2, -2, -2, -4, -2, -2, -2, -2, -4, 0, 0, -2, -2, -4, -2, 0, -2, -4, -2, -2, -2, -2, -2, -4, 0, -4, -4, -2, -2, -2, -2, -2, -2, -4, -2, -2, -2, -2, -2, 0, -2, -4, -4, -2, -2, 0, -4, -2, 0, -2,-2, 0, -2, -4, 0, -2, -2, 0, 0, -4, -4, 0, -2, 0, -2, -2, -4, 0, -2, -2, -2, 0, -2, -2, -2, -2, -2, -2, 0, 0, 0, -2, 0, -2, -4, 0, 0, 0, -2, -4, -4, 0, -2, -2, -4), format.stata = "%9.0g")),
row.names = c(NA,-100L), class = c("tbl_df", "tbl", "data.frame")))
# expand grid for every possible value
prs <- expand.grid(cat1 = unique(df$category1) ,
cat2 = unique(df$category2) ,
cat3 = unique(df$category3))
#Number of total combinations
N <- nrow(prs)
#Loop over the combinations to get the desired statistis
A <- lapply(1:N, FUN = function(i){
mean1 <- mean(df[(df$category1 == prs$cat1[i] & df$category2 == prs$cat2[i] & df$category3 == prs$cat3[i] ), "metric1"])
sd1 <- sd(df[(df$category1 == prs$cat1[i] & df$category2 == prs$cat2[i] & df$category3 == prs$cat3[i] ), "metric1"])
mean2 <- mean(df[(df$category1 == prs$cat1[i] & df$category2 == prs$cat2[i] & df$category3 == prs$cat3[i] ), "metric2"])
sd2 <- sd(df[(df$category1 == prs$cat1[i] & df$category2 == prs$cat2[i] & df$category3 == prs$cat3[i] ), "metric2"])
r_list<- list(cat1 = prs$cat1[i],cat2 = prs$cat2[i], cat3 = prs$cat3[i],
mean1 = mean1, sd1 = sd1 , mean2 = mean2, sd2 = sd2)
return(r_list)
})
#List to data.frame
df_stats <- do.call(rbind.data.frame, A)
获得的输出(但是,不是我想要的输出(!))
Obtained output (but, not my desired output (!) )
# cat1 cat2 cat3 mean1 sd1 mean2 sd2
# 2 1 C First 2.571429 2.507133 -1.142857 1.0690450
# 21 3 C First 5.166667 2.757909 -2.000000 1.4770979
# 3 2 C First 4.200000 2.740641 -2.000000 1.3333333
# 4 1 A First 5.777778 2.905933 -1.555556 1.6666667
# 5 3 A First 8.666667 2.309401 -3.333333 1.1547005
# 6 2 A First 3.200000 2.280351 -0.800000 1.0954451
# 7 1 B First 7.200000 3.346640 -2.000000 0.0000000
# 8 3 B First 4.666667 2.309401 -2.666667 1.1547005
# 9 2 B First 6.333333 3.444803 -2.000000 1.2649111
# 10 1 C Second 3.000000 1.154701 -3.000000 1.1547005
# 11 3 C Second 6.000000 3.464102 -1.333333 1.1547005
# 12 2 C Second 5.000000 2.449490 -1.000000 1.6733201
# 13 1 A Second 6.666667 2.309401 -2.000000 0.0000000
# 14 3 A Second 5.666667 3.444803 -2.333333 0.8164966
# 15 2 A Second 4.888889 3.179797 -2.222222 1.5634719
# 16 1 B Second 3.000000 1.414214 -2.000000 2.8284271
# 17 3 B Second 6.666667 4.163332 -2.000000 2.0000000
# 18 2 B Second 5.000000 3.464102 -1.500000 1.0000000
推荐答案
您可以如下使用 data.table
和 magrittr
包:
library(magrittr)
library(data.table)
# function to compute the mean and sd
fun <- function(x, y) list(metric1_meam=mean(x), metric1_sd=sd(x), metric2_meam=mean(y), metric2_sd=sd(y))
# compute the Total column, and A,B,C columns of the desired output as follows and bind them
setDT(df)[, 'category1' := as.character(category1)]
Y <- rbind(
df[, fun(metric1, metric2), by=.(category1, category2, category3)],
df[, fun(metric1, metric2), by=.(category1, category3)][, category2 := 'Total'],
df[, fun(metric1, metric2), by=.(category2, category3)][, category1 := 'Total'],
df[, fun(metric1, metric2), by=.(category3)][, c('category1', 'category2') := 'Total']
)
# generate the desired output
melt(Y, measure=patterns('metric')) %>%
xtabs(formula = value ~ .) %>%
ftable(col.vars = c('category3', 'category2'))
category3 First Second
category2 A B C Total A B C Total
category1 variable
1 metric1_meam 5.7777778 7.2000000 2.5714286 5.0476190 6.6666667 3.0000000 3.0000000 4.2222222
metric1_sd 2.9059326 3.3466401 2.5071327 3.3237959 2.3094011 1.4142136 1.1547005 2.3333333
metric2_meam -1.5555556 -2.0000000 -1.1428571 -1.5238095 -2.0000000 -2.0000000 -3.0000000 -2.4444444
metric2_sd 1.6666667 0.0000000 1.0690450 1.2497619 0.0000000 2.8284271 1.1547005 1.3333333
2 metric1_meam 3.2000000 6.3333333 4.2000000 4.5714286 4.8888889 5.0000000 5.0000000 4.9473684
metric1_sd 2.2803509 3.4448028 2.7406406 2.9760952 3.1797973 3.4641016 2.4494897 2.8572264
metric2_meam -0.8000000 -2.0000000 -2.0000000 -1.7142857 -2.2222222 -1.5000000 -1.0000000 -1.6842105
metric2_sd 1.0954451 1.2649111 1.3333333 1.3093073 1.5634719 1.0000000 1.6733201 1.5294382
3 metric1_meam 8.6666667 4.6666667 5.1666667 5.6666667 5.6666667 6.6666667 6.0000000 6.0000000
metric1_sd 2.3094011 2.3094011 2.7579087 2.8491485 3.4448028 4.1633320 3.4641016 3.3028913
metric2_meam -3.3333333 -2.6666667 -2.0000000 -2.3333333 -2.3333333 -2.0000000 -1.3333333 -2.0000000
metric2_sd 1.1547005 1.1547005 1.4770979 1.4142136 0.8164966 2.0000000 1.1547005 1.2060454
Total metric1_meam 5.5294118 6.2857143 4.2068966 5.0666667 5.4444444 5.1111111 4.6153846 5.1000000
metric1_sd 3.1248529 3.1238185 2.7951400 3.0469027 3.0529103 3.3333333 2.5012817 2.8982753
metric2_meam -1.6470588 -2.1428571 -1.7931034 -1.8333333 -2.2222222 -1.7777778 -1.6923077 -1.9500000
metric2_sd 1.6179144 0.9492623 1.3464055 1.3424827 1.1659662 1.5634719 1.6012815 1.3950462
这篇关于等价于`table,contents()`Stata命令,用于摘要统计的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!