如何计算data.table中十进制组的统计信息 [英] How can I compute statistics by decile groups in data.table

查看:136
本文介绍了如何计算data.table中十进制组的统计信息的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个data.table,并希望按组计算stats。

I have a data.table and would like to compute stats by groups.

R) set.seed(1)
R) DT=data.table(a=rnorm(100),b=rnorm(100))

这些组应该由

R) quantile(DT$a,probs=seq(.1,.9,.1))
           10%            20%            30%            40%            50%            60%            70%            80%            90% 
-1.05265747329 -0.61386923071 -0.37534201964 -0.07670312896  0.11390916079  0.37707993057  0.58121734252  0.77125359976  1.18106507751 

我如何计算每个bin的 b 如果 b = - 。5 我在 [ - 0.61386923071,-0.37534201964] c> 3

How can I compute say the average of b per bin, say if b=-.5 I am within [-0.61386923071,-0.37534201964] so in bin 3

推荐答案

如何:

> DT[, mean(b), keyby=cut(a,quantile(a,probs=seq(.1,.9,.1)))]
                cut          V1
1:               NA -0.31359818
2:   (-1.05,-0.614] -0.14103182
3:  (-0.614,-0.375] -0.33474492
4: (-0.375,-0.0767]  0.20827735
5:  (-0.0767,0.114]  0.14890251
6:    (0.114,0.377]  0.16685304
7:    (0.377,0.581]  0.07086979
8:    (0.581,0.771]  0.17950572
9:     (0.771,1.18] -0.04951607

要查看该NA(并检查结果) I next did:

To have a look at that NA (and to check the results anyway), I next did :

> DT[, list(mean(b),.N,list(a)), keyby=cut(a,quantile(a,probs=seq(.1,.9,.1)))]
                cut          V1  N                                                                                                                      V3
1:               NA -0.31359818 20                1.59528080213779,1.51178116845085,-2.2146998871775,-1.98935169586337,-1.47075238389927,1.35867955152904,
2:   (-1.05,-0.614] -0.14103182 10        -0.626453810742332,-0.835628612410047,-0.820468384118015,-0.621240580541804,-0.68875569454952,-0.70749515696212,
3:  (-0.614,-0.375] -0.33474492 10        -0.47815005510862,-0.41499456329968,-0.394289953710349,-0.612026393250771,-0.443291873218433,-0.589520946188072,
4: (-0.375,-0.0767]  0.20827735 10      -0.305388387156356,-0.155795506705329,-0.102787727342996,-0.164523596253587,-0.253361680136508,-0.112346212150228,
5:  (-0.0767,0.114]  0.14890251 10 -0.0449336090152309,-0.0161902630989461,0.0745649833651906,-0.0561287395290008,-0.0538050405829051,-0.0593133967111857,
6:    (0.114,0.377]  0.16685304 10             0.183643324222082,0.329507771815361,0.36458196213683,0.341119691424425,0.188792299514343,0.153253338211898,
7:    (0.377,0.581]  0.07086979 10            0.487429052428485,0.575781351653492,0.389843236411431,0.417941560199702,0.387671611559369,0.556663198673657,
8:    (0.581,0.771]  0.17950572 10             0.738324705129217,0.593901321217509,0.61982574789471,0.763175748457544,0.696963375404737,0.768532924515416,
9:     (0.771,1.18] -0.04951607 10              1.12493091814311,0.943836210685299,0.821221195098089,0.918977371608218,0.782136300731067,1.10002537198388,

Aside:我已经返回了一个列表有快速看看进入箱子的值,只是检查。 data.table 在打印时显示逗号(仅显示每个单元格的前6个项目),但 V3 实际上有一个数字向量。

Aside: I've returned a list column (each cell is itself a vector) there to have a quick look at the values going into the bins, just to check. data.table displays commas when printing (and shows just the first 6 items per cell), but each cell of V3 there is actually a numeric vector.

因此,第一个和最后一个 break 之外的值被一起编码为NA 。我不明白如何告诉 cut 不要这样做。所以我只是添加了-Inf和+ Inf:

So the values outside the first and last break are being coded together as NA. It's not obvious to me how to tell cut not to do that. So I just added -Inf and +Inf :

> DT[,list(mean(b),.N),keyby=cut(a,c(-Inf,quantile(a,probs=seq(.1,.9,.1)),+Inf))]
                 cut          V1  N
 1:     (-Inf,-1.05] -0.16938368 10
 2:   (-1.05,-0.614] -0.14103182 10
 3:  (-0.614,-0.375] -0.33474492 10
 4: (-0.375,-0.0767]  0.20827735 10
 5:  (-0.0767,0.114]  0.14890251 10
 6:    (0.114,0.377]  0.16685304 10
 7:    (0.377,0.581]  0.07086979 10
 8:    (0.581,0.771]  0.17950572 10
 9:     (0.771,1.18] -0.04951607 10
10:      (1.18, Inf] -0.45781268 10

这更好,或者:

> DT[, list(mean(b),.N), keyby=cut(a,quantile(a,probs=seq(0,1,.1)),include=TRUE)]
                 cut          V1  N
 1:    [-2.21,-1.05] -0.16938368 10
 2:   (-1.05,-0.614] -0.14103182 10
 3:  (-0.614,-0.375] -0.33474492 10
 4: (-0.375,-0.0767]  0.20827735 10
 5:  (-0.0767,0.114]  0.14890251 10
 6:    (0.114,0.377]  0.16685304 10
 7:    (0.377,0.581]  0.07086979 10
 8:    (0.581,0.771]  0.17950572 10
 9:     (0.771,1.18] -0.04951607 10
10:       (1.18,2.4] -0.45781268 10

这样你就可以看到最小值和最大值,而不是显示-Inf和+ Inf。注意,你需要传递 include = TRUE cut ,否则将返回11个bin,第一个只有1个。

That way you see what the min and max is, rather than it displaying -Inf and +Inf. Notice you need to pass include=TRUE to cut otherwise 11 bins will be returned with only 1 in the first.

这篇关于如何计算data.table中十进制组的统计信息的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆