如何在 R 中模拟 SQL 排名函数? [英] How to emulate SQLs rank functions in R?
问题描述
与 Oracle ROW_NUMBER()
、RANK()
或 DENSE_RANK()
(assign行的整数值取决于它们的顺序";参见 http://www.orafaq.com/node/55)?
What is the R equivalent of rank functions like the Oracle ROW_NUMBER()
, RANK()
, or DENSE_RANK()
("assign integer values to the rows depending on their order"; see http://www.orafaq.com/node/55)?
我同意每个功能的功能都可以以临时方式实现.但我主要关心的是性能.为了内存和速度,最好避免使用连接或索引访问.
I agree that the functionality of each function can potentially be achieved in an ad-hoc fashion. But my main concern is the performance. It would be good to avoid using join or indexing access, for the sake of memory and speed.
推荐答案
data.table
包,尤其是从 1.8.1 版开始,提供了 SQL 术语中的许多分区功能.R 中的 rank(x, ties.method = "min")
类似于 Oracle RANK()
,并且有一种方法使用因子(如下所述)来模仿 RANK()
code>DENSE_RANK() 函数.一种模仿 ROW_NUMBER
的方法应该是显而易见的.
The data.table
package, especially starting with version 1.8.1, offers much of the functionality of partition in SQL terms. rank(x, ties.method = "min")
in R is similar to Oracle RANK()
, and there's a way using factors (described below) to mimic the DENSE_RANK()
function. A way to mimic ROW_NUMBER
should be obvious by the end.
这是一个例子:从 R-Forge 加载最新版本的 data.table
:
Here's an example: Load the latest version of data.table
from R-Forge:
install.packages("data.table",
repos= c("http://R-Forge.R-project.org", getOption("repos")))
library(data.table)
创建一些示例数据:
set.seed(10)
DT<-data.table(ID=seq_len(4*3),group=rep(1:4,each=3),value=rnorm(4*3),
info=c(sample(c("a","b"),4*2,replace=TRUE),
sample(c("c","d"),4,replace=TRUE)),key="ID")
> DT
ID group value info
1: 1 1 0.01874617 a
2: 2 1 -0.18425254 b
3: 3 1 -1.37133055 b
4: 4 2 -0.59916772 a
5: 5 2 0.29454513 b
6: 6 2 0.38979430 a
7: 7 3 -1.20807618 b
8: 8 3 -0.36367602 a
9: 9 3 -1.62667268 c
10: 10 4 -0.25647839 d
11: 11 4 1.10177950 c
12: 12 4 0.75578151 d
通过减少group
中的value
对每个ID
进行排名(注意value前面的
-
表示降序):
Rank each ID
by decreasing value
within group
(note the -
in front of value
to denote decreasing order):
> DT[,valRank:=rank(-value),by="group"]
ID group value info valRank
1: 1 1 0.01874617 a 1
2: 2 1 -0.18425254 b 2
3: 3 1 -1.37133055 b 3
4: 4 2 -0.59916772 a 3
5: 5 2 0.29454513 b 2
6: 6 2 0.38979430 a 1
7: 7 3 -1.20807618 b 2
8: 8 3 -0.36367602 a 1
9: 9 3 -1.62667268 c 3
10: 10 4 -0.25647839 d 3
11: 11 4 1.10177950 c 1
12: 12 4 0.75578151 d 2
对于 DENSE_RANK()
与被排序的值有联系,您可以将值转换为一个因子,然后返回基础整数值.例如,根据 group
中的 info
对每个 ID
进行排名(比较 infoRank
和 infoRankDense
>):
For DENSE_RANK()
with ties in the value being ranked, you could convert the value to a factor and then return the underlying integer values. For example, ranking each ID
based on info
within group
(compare infoRank
with infoRankDense
):
DT[,infoRank:=rank(info,ties.method="min"),by="group"]
DT[,infoRankDense:=as.integer(factor(info)),by="group"]
R> DT
ID group value info valRank infoRank infoRankDense
1: 1 1 0.01874617 a 1 1 1
2: 2 1 -0.18425254 b 2 2 2
3: 3 1 -1.37133055 b 3 2 2
4: 4 2 -0.59916772 a 3 1 1
5: 5 2 0.29454513 b 2 3 2
6: 6 2 0.38979430 a 1 1 1
7: 7 3 -1.20807618 b 2 2 2
8: 8 3 -0.36367602 a 1 1 1
9: 9 3 -1.62667268 c 3 3 3
10: 10 4 -0.25647839 d 3 2 2
11: 11 4 1.10177950 c 1 1 1
12: 12 4 0.75578151 d 2 2 2
附言马修·道尔.
LEAD 和 LAG
要模仿 LEAD 和 LAG,请从此处提供的答案开始.我会根据组内 ID 的顺序创建一个排名变量.对于上面的假数据,这不是必需的,但是如果 ID 在组内不是按顺序排列的,那么这会让生活变得更加困难.所以这里有一些新的带有非序列 ID 的假数据:
For imitating LEAD and LAG, start with the answer provided here. I would create a rank variable based on the order of IDs within groups. This wouldn't be necessary with the fake data as above, but if the IDs are not in sequential order within groups, then this would make life a bit more difficult. So here's some new fake data with non-sequential IDs:
set.seed(10)
DT<-data.table(ID=sample(seq_len(4*3)),group=rep(1:4,each=3),value=rnorm(4*3),
info=c(sample(c("a","b"),4*2,replace=TRUE),
sample(c("c","d"),4,replace=TRUE)),key="ID")
DT[,idRank:=rank(ID),by="group"]
setkey(DT,group, idRank)
> DT
ID group value info idRank
1: 4 1 -0.36367602 b 1
2: 5 1 -1.62667268 b 2
3: 7 1 -1.20807618 b 3
4: 1 2 1.10177950 a 1
5: 2 2 0.75578151 a 2
6: 12 2 -0.25647839 b 3
7: 3 3 0.74139013 c 1
8: 6 3 0.98744470 b 2
9: 9 3 -0.23823356 a 3
10: 8 4 -0.19515038 c 1
11: 10 4 0.08934727 c 2
12: 11 4 -0.95494386 c 3
然后要获得前 1 条记录的值,使用 group
和 idRank
变量并从 idRank 中减去
1
并使用 multi = 'last'
参数.要从上面两个条目的记录中获取值,请减去 2
.
Then to get the values of the previous 1 record, use the group
and idRank
variables and subtract 1
from the idRank
and use the multi = 'last'
argument. To get the value from the record two entries above, subtract 2
.
DT[,prev:=DT[J(group,idRank-1), value, mult='last']]
DT[,prev2:=DT[J(group,idRank-2), value, mult='last']]
ID group value info idRank prev prev2
1: 4 1 -0.36367602 b 1 NA NA
2: 5 1 -1.62667268 b 2 -0.36367602 NA
3: 7 1 -1.20807618 b 3 -1.62667268 -0.3636760
4: 1 2 1.10177950 a 1 NA NA
5: 2 2 0.75578151 a 2 1.10177950 NA
6: 12 2 -0.25647839 b 3 0.75578151 1.1017795
7: 3 3 0.74139013 c 1 NA NA
8: 6 3 0.98744470 b 2 0.74139013 NA
9: 9 3 -0.23823356 a 3 0.98744470 0.7413901
10: 8 4 -0.19515038 c 1 NA NA
11: 10 4 0.08934727 c 2 -0.19515038 NA
12: 11 4 -0.95494386 c 3 0.08934727 -0.1951504
对于LEAD,为idRank
变量添加适当的偏移量并切换到multi = 'first'
:
For LEAD, add the appropriate offset to the idRank
variable and switch to multi = 'first'
:
DT[,nex:=DT[J(group,idRank+1), value, mult='first']]
DT[,nex2:=DT[J(group,idRank+2), value, mult='first']]
ID group value info idRank prev prev2 nex nex2
1: 4 1 -0.36367602 b 1 NA NA -1.62667268 -1.2080762
2: 5 1 -1.62667268 b 2 -0.36367602 NA -1.20807618 NA
3: 7 1 -1.20807618 b 3 -1.62667268 -0.3636760 NA NA
4: 1 2 1.10177950 a 1 NA NA 0.75578151 -0.2564784
5: 2 2 0.75578151 a 2 1.10177950 NA -0.25647839 NA
6: 12 2 -0.25647839 b 3 0.75578151 1.1017795 NA NA
7: 3 3 0.74139013 c 1 NA NA 0.98744470 -0.2382336
8: 6 3 0.98744470 b 2 0.74139013 NA -0.23823356 NA
9: 9 3 -0.23823356 a 3 0.98744470 0.7413901 NA NA
10: 8 4 -0.19515038 c 1 NA NA 0.08934727 -0.9549439
11: 10 4 0.08934727 c 2 -0.19515038 NA -0.95494386 NA
12: 11 4 -0.95494386 c 3 0.08934727 -0.1951504 NA NA
这篇关于如何在 R 中模拟 SQL 排名函数?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!