比较data.table列中的组 [英] Comparing Groups in data.table Columns

查看:106
本文介绍了比较data.table列中的组的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个数据集,我需要两个拆分一个变量( Day ),然后在另一个变量的组之间进行比较( code>),执行每组统计(例如 mean )并测试。

I have a dataset that I need to both split by one variable (Day) and then compare between groups of another variable (Group), performing per-group statistics (e.g. mean) and also tests.

我设计的示例:

require(data.table)
data = data.table(Day = rep(1:10, each = 10), 
                  Group = rep(1:2, times = 50),
                  V = rnorm(100))

data[, .(g1_mean = mean(.SD[Group == 1]$V),
         g2_mean = mean(.SD[Group == 2]$V),
         p.value = t.test(V ~ Group, .SD, alternative = "two.sided")$p.value),
     by = list(Day)]

其中产生:

    Day      g1_mean     g2_mean   p.value
 1:   1  0.883406048  0.67177271 0.6674138
 2:   2  0.007544956 -0.55609722 0.3948459
 3:   3  0.409248637  0.28717183 0.8753213
 4:   4 -0.540075365  0.23181458 0.1785854
 5:   5 -0.632543900 -1.09965990 0.6457325
 6:   6 -0.083221671 -0.96286343 0.2011136
 7:   7 -0.044674252 -0.27666473 0.7079499
 8:   8  0.260795244 -0.15159164 0.4663712
 9:   9 -0.134164758  0.01136245 0.7992453
10:  10  0.496144329  0.76168408 0.1821123



我希望有一个较少迂回的方式来达到这个结果。 >

I'm hoping that there's a less roundabout manner of arriving at this result.

推荐答案

一个可能的紧凑型替代方案,也可以对每个组应用更多的功能:

A possible compact alternative which can also apply more functions to each group:

DTnew <- dcast(DT[, pval := t.test(V ~ Group, .SD, alternative = "two.sided")$p.value, Day],
               Day + pval ~ paste0("g",Group), fun = list(mean,sd), value.var = "V")

其中:

> DTnew
    Day      pval   V_mean_g1    V_mean_g2   V_sd_g1   V_sd_g2
 1:   1 0.4763594 -0.11630634  0.178240714 0.7462975 0.4516087
 2:   2 0.5715001 -0.29689807  0.082970631 1.3614177 0.2745783
 3:   3 0.2295251 -0.48792449 -0.031328749 0.3723247 0.6703694
 4:   4 0.5565573  0.33982242  0.080169698 0.5635136 0.7560959
 5:   5 0.5498684 -0.07554433  0.308661427 0.9343230 1.0100788
 6:   6 0.4814518  0.57694034  0.885968245 0.6457926 0.6773873
 7:   7 0.8053066  0.29845913  0.116217727 0.9541060 1.2782210
 8:   8 0.3549573  0.14827289 -0.319017581 0.5328734 0.9036501
 9:   9 0.7290625 -0.21589411 -0.005785092 0.9639758 0.8859461
10:  10 0.9899833  0.84034529  0.850429982 0.6645952 1.5809149

的分解:


  • 首先,使用 DT [,pval:= t.test(V〜Group,.SD,...),将 pval alternative =two.sided)$ p.value,Day]

  • 因为 DT dcast 函数可以直接应用于该函数。

  • 在转换公式中,指定需要保留在RHS上的当前表单的变量和需要分布在LHS上的列的变量。

  • 使用 fun 参数你可以指定在 value.var (这里 V )。如果需要多个聚合函数,可以在列表中指定它们(例如 list(mean,sd))。这可以是任何类型的函数。因此,也可以使用cumstom函数。

  • First, a pval variable is added to the dataset with DT[, pval := t.test(V ~ Group, .SD, alternative = "two.sided")$p.value, Day]
  • Because DT is updated in place and by reference by the previous step, the dcast function can be applied to that directly.
  • In the casting formula, you specify the variables that need to stay in the current form on the RHS and the variable that needs to be spread over columns on the LHS.
  • With the fun argument you can specify which aggregation function has to be used on the value.var (here V). If multiple aggregation functions are needed, you can specify them in a list (e.g. list(mean,sd)). This can be any type of function. So, also cumstom made functions can be used.

如果要删除 V _ 从列名称,您可以:

If you want to remove the V_ from the column names, you can do:

names(DTnew) <- gsub("V_","",names(DTnew))






注意:我将 data.table 重命名为 DT ,因为在函数后命名数据集通常是不明智的检查?data


NOTE: I renamed the data.table to DT as it is often not wise to name your dataset after a function (check ?data)

这篇关于比较data.table列中的组的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆