如何在R中模拟SQL排名函数? [英] How to emulate SQLs rank functions in R?

查看:101
本文介绍了如何在R中模拟SQL排名函数?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

类似于Oracle ROW_NUMBER()RANK()DENSE_RANK()的秩函数的R等效项是什么(根据行的顺序将整数值分配给行";请参见

What is the R equivalent of rank functions like the Oracle ROW_NUMBER(), RANK(), or DENSE_RANK() ("assign integer values to the rows depending on their order"; see http://www.orafaq.com/node/55)?

我同意可以以临时方式实现每个功能的功能.但是我主要关心的是性能.出于内存和速度的考虑,最好避免使用联接或索引访问.

I agree that the functionality of each function can potentially be achieved in an ad-hoc fashion. But my main concern is the performance. It would be good to avoid using join or indexing access, for the sake of memory and speed.

推荐答案

data.table软件包,尤其是从1.8.1版开始,以SQL术语提供了分区的许多功能. R中的rank(x, ties.method = "min")与Oracle RANK()相似,并且有一种使用因素(如下所述)模仿DENSE_RANK()功能的方法.最后,模仿ROW_NUMBER的方法应该很明显.

The data.table package, especially starting with version 1.8.1, offers much of the functionality of partition in SQL terms. rank(x, ties.method = "min") in R is similar to Oracle RANK(), and there's a way using factors (described below) to mimic the DENSE_RANK() function. A way to mimic ROW_NUMBER should be obvious by the end.

这里是一个示例:从R-Forge加载最新版本的data.table:

Here's an example: Load the latest version of data.table from R-Forge:

install.packages("data.table",
  repos= c("http://R-Forge.R-project.org", getOption("repos")))

library(data.table)

创建一些示例数据:

set.seed(10)

DT<-data.table(ID=seq_len(4*3),group=rep(1:4,each=3),value=rnorm(4*3),
  info=c(sample(c("a","b"),4*2,replace=TRUE),
  sample(c("c","d"),4,replace=TRUE)),key="ID")

> DT
    ID group       value info
 1:  1     1  0.01874617    a
 2:  2     1 -0.18425254    b
 3:  3     1 -1.37133055    b
 4:  4     2 -0.59916772    a
 5:  5     2  0.29454513    b
 6:  6     2  0.38979430    a
 7:  7     3 -1.20807618    b
 8:  8     3 -0.36367602    a
 9:  9     3 -1.62667268    c
10: 10     4 -0.25647839    d
11: 11     4  1.10177950    c
12: 12     4  0.75578151    d

通过减小group中的value来排名每个ID(请注意value前面的-以表示降序):

Rank each ID by decreasing value within group (note the - in front of value to denote decreasing order):

> DT[,valRank:=rank(-value),by="group"]
    ID group       value info valRank
 1:  1     1  0.01874617    a       1
 2:  2     1 -0.18425254    b       2
 3:  3     1 -1.37133055    b       3
 4:  4     2 -0.59916772    a       3
 5:  5     2  0.29454513    b       2
 6:  6     2  0.38979430    a       1
 7:  7     3 -1.20807618    b       2
 8:  8     3 -0.36367602    a       1
 9:  9     3 -1.62667268    c       3
10: 10     4 -0.25647839    d       3
11: 11     4  1.10177950    c       1
12: 12     4  0.75578151    d       2

对于在值中具有平局关系的DENSE_RANK(),您可以将值转换为因子,然后返回基础整数值.例如,根据group中的info对每个ID进行排名(将infoRankinfoRankDense进行比较):

For DENSE_RANK() with ties in the value being ranked, you could convert the value to a factor and then return the underlying integer values. For example, ranking each ID based on info within group (compare infoRank with infoRankDense):

DT[,infoRank:=rank(info,ties.method="min"),by="group"]
DT[,infoRankDense:=as.integer(factor(info)),by="group"]

R> DT
    ID group       value info valRank infoRank infoRankDense
 1:  1     1  0.01874617    a       1        1             1
 2:  2     1 -0.18425254    b       2        2             2
 3:  3     1 -1.37133055    b       3        2             2
 4:  4     2 -0.59916772    a       3        1             1
 5:  5     2  0.29454513    b       2        3             2
 6:  6     2  0.38979430    a       1        1             1
 7:  7     3 -1.20807618    b       2        2             2
 8:  8     3 -0.36367602    a       1        1             1
 9:  9     3 -1.62667268    c       3        3             3
10: 10     4 -0.25647839    d       3        2             2
11: 11     4  1.10177950    c       1        1             1
12: 12     4  0.75578151    d       2        2             2

p.s.嗨Matthew Dowle.

p.s. Hi Matthew Dowle.

领导和落后

要模仿LEAD和LAG,请从此处提供的答案开始.我将根据组内ID的顺序创建一个等级变量.对于上面的伪数据来说,这不是必需的,但是,如果ID在组中的顺序不是按顺序的,那么这将使生活变得更加困难.因此,这是一些具有非顺序ID的新伪造数据:

For imitating LEAD and LAG, start with the answer provided here. I would create a rank variable based on the order of IDs within groups. This wouldn't be necessary with the fake data as above, but if the IDs are not in sequential order within groups, then this would make life a bit more difficult. So here's some new fake data with non-sequential IDs:

set.seed(10)

DT<-data.table(ID=sample(seq_len(4*3)),group=rep(1:4,each=3),value=rnorm(4*3),
  info=c(sample(c("a","b"),4*2,replace=TRUE),
  sample(c("c","d"),4,replace=TRUE)),key="ID")

DT[,idRank:=rank(ID),by="group"]
setkey(DT,group, idRank)

> DT
    ID group       value info idRank
 1:  4     1 -0.36367602    b      1
 2:  5     1 -1.62667268    b      2
 3:  7     1 -1.20807618    b      3
 4:  1     2  1.10177950    a      1
 5:  2     2  0.75578151    a      2
 6: 12     2 -0.25647839    b      3
 7:  3     3  0.74139013    c      1
 8:  6     3  0.98744470    b      2
 9:  9     3 -0.23823356    a      3
10:  8     4 -0.19515038    c      1
11: 10     4  0.08934727    c      2
12: 11     4 -0.95494386    c      3

然后要获取上一条记录的值,请使用groupidRank变量,并从idRank中减去1并使用multi = 'last'参数.要从上面的记录中的两个条目中获取值,请减去2.

Then to get the values of the previous 1 record, use the group and idRank variables and subtract 1 from the idRank and use the multi = 'last' argument. To get the value from the record two entries above, subtract 2.

DT[,prev:=DT[J(group,idRank-1), value, mult='last']]
DT[,prev2:=DT[J(group,idRank-2), value, mult='last']]

    ID group       value info idRank        prev      prev2
 1:  4     1 -0.36367602    b      1          NA         NA
 2:  5     1 -1.62667268    b      2 -0.36367602         NA
 3:  7     1 -1.20807618    b      3 -1.62667268 -0.3636760
 4:  1     2  1.10177950    a      1          NA         NA
 5:  2     2  0.75578151    a      2  1.10177950         NA
 6: 12     2 -0.25647839    b      3  0.75578151  1.1017795
 7:  3     3  0.74139013    c      1          NA         NA
 8:  6     3  0.98744470    b      2  0.74139013         NA
 9:  9     3 -0.23823356    a      3  0.98744470  0.7413901
10:  8     4 -0.19515038    c      1          NA         NA
11: 10     4  0.08934727    c      2 -0.19515038         NA
12: 11     4 -0.95494386    c      3  0.08934727 -0.1951504

对于LEAD,将适当的偏移量添加到idRank变量并切换到multi = 'first':

For LEAD, add the appropriate offset to the idRank variable and switch to multi = 'first':

DT[,nex:=DT[J(group,idRank+1), value, mult='first']]
DT[,nex2:=DT[J(group,idRank+2), value, mult='first']]

    ID group       value info idRank        prev      prev2         nex       nex2
 1:  4     1 -0.36367602    b      1          NA         NA -1.62667268 -1.2080762
 2:  5     1 -1.62667268    b      2 -0.36367602         NA -1.20807618         NA
 3:  7     1 -1.20807618    b      3 -1.62667268 -0.3636760          NA         NA
 4:  1     2  1.10177950    a      1          NA         NA  0.75578151 -0.2564784
 5:  2     2  0.75578151    a      2  1.10177950         NA -0.25647839         NA
 6: 12     2 -0.25647839    b      3  0.75578151  1.1017795          NA         NA
 7:  3     3  0.74139013    c      1          NA         NA  0.98744470 -0.2382336
 8:  6     3  0.98744470    b      2  0.74139013         NA -0.23823356         NA
 9:  9     3 -0.23823356    a      3  0.98744470  0.7413901          NA         NA
10:  8     4 -0.19515038    c      1          NA         NA  0.08934727 -0.9549439
11: 10     4  0.08934727    c      2 -0.19515038         NA -0.95494386         NA
12: 11     4 -0.95494386    c      3  0.08934727 -0.1951504          NA         NA

这篇关于如何在R中模拟SQL排名函数?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆