如何模拟SQL“分区由”在R? [英] How to emulate SQL "partition by" in R?
问题描述
如何使用Oracle ROW_NUMBER(),RANK()或DENSE_RANK()函数来执行分析函数(请参阅 http://www.orafaq.com/node/55 )在R数据框上? CRAN软件包plyr非常接近,但仍然不同。
我同意每个功能的功能可以以ad-hoc的方式实现。但我主要关注的是表现。
data.table
包,尤其是从1.8.1开始,提供了SQL中分区的大部分功能。 R中的 rank(x,ties.method =min)
类似于Oracle RANK()
一种使用因子(如下所述)模仿 DENSE_RANK()
函数的方法。模拟 ROW_NUMBER
的方法应该是显而易见的。
下面是一个例子:加载最新版本的 data.table
from R-Forge:
install.packages .table,
repos = c(http://R-Forge.R-project.org,getOption(repos)))
library(data.table)
创建一些示例数据:
set.seed(10)
DT <-data.table(ID = seq_len(4 * 3),group = rep(1:4,each = 3) ,value = rnorm(4 * 3),
info = c(sample(c(a,b),4 * 2,replace = TRUE),
sample ,d),4,replace = TRUE)),key =ID)
> DT
ID组值信息
1:1 1 0.01874617 a
2:2 1 -0.18425254 b
3:3 1 -1.37133055 b
4:4 2 - 0.59916772 a
5:5 2 0.29454513 b
6:6 2 0.38979430 a
7:7 3 -1.20807618 b
8:8 3 -0.36367602 a
9: 9 3 -1.62667268 c
10:10 4 -0.25647839 d
11:11 4 1.10177950 c
12:12 4 0.75578151 d
通过减少中的
(注意值
,对每个 ID
c $ c> group 值
前面的 -
):
> DT [,valRank:= rank(-value),by =group]
ID组值信息valRank
1:1 1 0.01874617 a 1
2:2 1 -0.18425254 b 2
3:3 1 -1.37133055 b 3
4:4 2 -0.59916772 a 3
5:5 2 0.29454513 b 2
6:6 2 0.38979430 a 1
7:7 3 -1.20807618 b 2
8:8 3 -0.36367602 a 1
9:9 3 -1.62667268 c 3
10:10 4 -0.25647839 d 3
11: 11 4 1.10177950 c 1
12:12 4 0.75578151 d 2
c> DENSE_RANK(),其值为被排序的值,可以将该值转换为一个因子,然后返回基础整数值。例如,基于组中的
信息
对每个 ID
(比较 infoRank
与 infoRankDense
):
DT [,infoRank:= rank(info,ties.method =min),by =group]
DT [,infoRankDense:= as.integer ),by =group]
R> DT
ID组值信息valRank infoRank infoRankDense
1:1 1 0.01874617 a 1 1 1
2:2 1 -0.18425254 b 2 2 2
3:3 1 -1.37133055 b 3 2 2
4:4 2 -0.59916772 a 3 1 1
5:5 2 0.29454513 b 2 3 2
6:6 2 0.38979430 a 1 1 1
7:7 3 -1.20807618 b 2 2 2
8:8 3 -0.36367602 a 1 1 1
9:9 3 -1.62667268 c 3 3 3
10:10 4 -0.25647839 d 3 2 2
11:11 4 1.10177950 c 1 1 1
12:12 4 0.75578151 d 2 2 2
$ b b pps
$ b
LEAD和LAG $ b
对于模仿LEAD和LAG,请从此处提供的答案开始。我将基于组中的ID的顺序创建一个rank变量。这对于假数据如上所述是不必要的,但是如果ID不是在组内的顺序次序,则这将使生活更困难。以下是一些包含非连续ID的新假数据:
set.seed(10)
DT <-data_table(ID = sample(seq_len(4 * 3)),group = rep(1:4,each = 3),value = rnorm(4 * 3),
info = c (c(c,d),4,replace = TRUE)),key =ID )
DT [,idRank:= rank(ID),by =group]
setkey(DT,group,idRank)
> DT
ID组值信息idRank
1:4 1 -0.36367602 b 1
2:5 1 -1.62667268 b 2
3:7 1 -1.20807618 b 3
4:1 2 1.10177950 a 1
5:2 2 0.75578151 a 2
6:12 2 -0.25647839 b 3
7:3 3 0.74139013 c 1
8:6 3 0.98744470 b 2
9:9 3 -0.23823356 a 3
10:8 4 -0.19515038 c 1
11:10 4 0.08934727 c 2
12:11 4 -0.95494386 c 3
然后,要获取上一个记录的值,请使用 code>和
idRank
变量并从 idRank $ c中减去
1
$ c>并使用 multi ='last'
参数。要从上述两条记录中获取值,请减去 2
。
DT [,prev2:= DT [J(group,idRank-2),value, mult ='last']]
ID组值信息idRank prev prev2
1:4 1 -0.36367602 b 1 NA NA
2:5 1 -1.62667268 b 2 -0.36367602 NA
3:7 1 -1.20807618 b 3 -1.62667268 -0.3636760
4:1 2 1.10177950 a 1 NA NA
5:2 2 0.75578151 a 2 1.10177950 NA
6:12 2 -0.25647839 b 3 0.75578151 1.1017795
7:3 3 0.74139013 c 1 NA NA
8:6 3 0.98744470 b 2 0.74139013 NA
9:9 3 -0.23823356 a 3 0.98744470 0.7413901
10:8 4 -0.19515038 c 1 NA NA
11:10 4 0.08934727 c 2 -0.19515038 NA
12:11 4 -0.95494386 c 3 0.08934727 -0.1951504
对于LEAD,向 idRank
变量添加适当的偏移量,然后切换到 multi ='first'
:
DT [,nex: ,idRank + 1),value,mult ='first']]
DT [,nex2:= DT [J(group,idRank + 2),value,mult ='first']]
ID组值信息idRank prev prev2 nex nex2
1:4 1 -0.36367602 b 1 NA NA -1.62667268 -1.2080762
2:5 1 -1.62667268 b 2 -0.36367602 NA -1.20807618 NA
3:7 1 -1.20807618 b 3 -1.62667268 -0.3636760 NA NA
4:1 2 1.10177950 a 1 NA NA 0.75578151 -0.2564784
5:2 2 0.75578151 a 2 1.10177950 NA -0.25647839 NA
6:12 2 -0.25647839 b 3 0.75578151 1.1017795 NA NA
7:3 3 0.74139013 c 1 NA NA 0.98744470 -0.2382336
8:6 3 0.98744470 b 2 0.74139013 NA $ 0.23823356 NA
9:9 3 -0.23823356 a 3 0.98744470 0.7413901 NA NA
10:8 4 -0.19515038 c 1 NA NA 0.08934727 -0.9549439
11:10 4 0.08934727 c 2 -0.19515038 NA -0.95494386 NA
12:11 4 -0.95494386 c 3 0.08934727 -0.1951504 NA NA
How can I do analytic functions like the Oracle ROW_NUMBER(), RANK(), or DENSE_RANK() functions (see http://www.orafaq.com/node/55) on a R data frame? The CRAN package "plyr" is very close but is still different.
I agree that the functionality of each function can potentially be achieved in an ad-hoc fashion. But my main concern is the performance. It would be good to avoid using join or indexing access, for the sake of memory and speed.
The data.table
package, especially starting with version 1.8.1, offers much of the functionality of partition in SQL terms. rank(x, ties.method = "min")
in R is similar to Oracle RANK()
, and there's a way using factors (described below) to mimic the DENSE_RANK()
function. A way to mimic ROW_NUMBER
should be obvious by the end.
Here's an example: Load the latest version of data.table
from R-Forge:
install.packages("data.table",
repos= c("http://R-Forge.R-project.org", getOption("repos")))
library(data.table)
Create some example data:
set.seed(10)
DT<-data.table(ID=seq_len(4*3),group=rep(1:4,each=3),value=rnorm(4*3),
info=c(sample(c("a","b"),4*2,replace=TRUE),
sample(c("c","d"),4,replace=TRUE)),key="ID")
> DT
ID group value info
1: 1 1 0.01874617 a
2: 2 1 -0.18425254 b
3: 3 1 -1.37133055 b
4: 4 2 -0.59916772 a
5: 5 2 0.29454513 b
6: 6 2 0.38979430 a
7: 7 3 -1.20807618 b
8: 8 3 -0.36367602 a
9: 9 3 -1.62667268 c
10: 10 4 -0.25647839 d
11: 11 4 1.10177950 c
12: 12 4 0.75578151 d
Rank each ID
by decreasing value
within group
(note the -
in front of value
to denote decreasing order):
> DT[,valRank:=rank(-value),by="group"]
ID group value info valRank
1: 1 1 0.01874617 a 1
2: 2 1 -0.18425254 b 2
3: 3 1 -1.37133055 b 3
4: 4 2 -0.59916772 a 3
5: 5 2 0.29454513 b 2
6: 6 2 0.38979430 a 1
7: 7 3 -1.20807618 b 2
8: 8 3 -0.36367602 a 1
9: 9 3 -1.62667268 c 3
10: 10 4 -0.25647839 d 3
11: 11 4 1.10177950 c 1
12: 12 4 0.75578151 d 2
For DENSE_RANK()
with ties in the value being ranked, you could convert the value to a factor and then return the underlying integer values. For example, ranking each ID
based on info
within group
(compare infoRank
with infoRankDense
):
DT[,infoRank:=rank(info,ties.method="min"),by="group"]
DT[,infoRankDense:=as.integer(factor(info)),by="group"]
R> DT
ID group value info valRank infoRank infoRankDense
1: 1 1 0.01874617 a 1 1 1
2: 2 1 -0.18425254 b 2 2 2
3: 3 1 -1.37133055 b 3 2 2
4: 4 2 -0.59916772 a 3 1 1
5: 5 2 0.29454513 b 2 3 2
6: 6 2 0.38979430 a 1 1 1
7: 7 3 -1.20807618 b 2 2 2
8: 8 3 -0.36367602 a 1 1 1
9: 9 3 -1.62667268 c 3 3 3
10: 10 4 -0.25647839 d 3 2 2
11: 11 4 1.10177950 c 1 1 1
12: 12 4 0.75578151 d 2 2 2
p.s. Hi Matthew Dowle.
LEAD and LAG
For imitating LEAD and LAG, start with the answer provided here. I would create a rank variable based on the order of IDs within groups. This wouldn't be necessary with the fake data as above, but if the IDs are not in sequential order within groups, then this would make life a bit more difficult. So here's some new fake data with non-sequential IDs:
set.seed(10)
DT<-data.table(ID=sample(seq_len(4*3)),group=rep(1:4,each=3),value=rnorm(4*3),
info=c(sample(c("a","b"),4*2,replace=TRUE),
sample(c("c","d"),4,replace=TRUE)),key="ID")
DT[,idRank:=rank(ID),by="group"]
setkey(DT,group, idRank)
> DT
ID group value info idRank
1: 4 1 -0.36367602 b 1
2: 5 1 -1.62667268 b 2
3: 7 1 -1.20807618 b 3
4: 1 2 1.10177950 a 1
5: 2 2 0.75578151 a 2
6: 12 2 -0.25647839 b 3
7: 3 3 0.74139013 c 1
8: 6 3 0.98744470 b 2
9: 9 3 -0.23823356 a 3
10: 8 4 -0.19515038 c 1
11: 10 4 0.08934727 c 2
12: 11 4 -0.95494386 c 3
Then to get the values of the previous 1 record, use the group
and idRank
variables and subtract 1
from the idRank
and use the multi = 'last'
argument. To get the value from the record two entries above, subtract 2
.
DT[,prev:=DT[J(group,idRank-1), value, mult='last']]
DT[,prev2:=DT[J(group,idRank-2), value, mult='last']]
ID group value info idRank prev prev2
1: 4 1 -0.36367602 b 1 NA NA
2: 5 1 -1.62667268 b 2 -0.36367602 NA
3: 7 1 -1.20807618 b 3 -1.62667268 -0.3636760
4: 1 2 1.10177950 a 1 NA NA
5: 2 2 0.75578151 a 2 1.10177950 NA
6: 12 2 -0.25647839 b 3 0.75578151 1.1017795
7: 3 3 0.74139013 c 1 NA NA
8: 6 3 0.98744470 b 2 0.74139013 NA
9: 9 3 -0.23823356 a 3 0.98744470 0.7413901
10: 8 4 -0.19515038 c 1 NA NA
11: 10 4 0.08934727 c 2 -0.19515038 NA
12: 11 4 -0.95494386 c 3 0.08934727 -0.1951504
For LEAD, add the appropriate offset to the idRank
variable and switch to multi = 'first'
:
DT[,nex:=DT[J(group,idRank+1), value, mult='first']]
DT[,nex2:=DT[J(group,idRank+2), value, mult='first']]
ID group value info idRank prev prev2 nex nex2
1: 4 1 -0.36367602 b 1 NA NA -1.62667268 -1.2080762
2: 5 1 -1.62667268 b 2 -0.36367602 NA -1.20807618 NA
3: 7 1 -1.20807618 b 3 -1.62667268 -0.3636760 NA NA
4: 1 2 1.10177950 a 1 NA NA 0.75578151 -0.2564784
5: 2 2 0.75578151 a 2 1.10177950 NA -0.25647839 NA
6: 12 2 -0.25647839 b 3 0.75578151 1.1017795 NA NA
7: 3 3 0.74139013 c 1 NA NA 0.98744470 -0.2382336
8: 6 3 0.98744470 b 2 0.74139013 NA -0.23823356 NA
9: 9 3 -0.23823356 a 3 0.98744470 0.7413901 NA NA
10: 8 4 -0.19515038 c 1 NA NA 0.08934727 -0.9549439
11: 10 4 0.08934727 c 2 -0.19515038 NA -0.95494386 NA
12: 11 4 -0.95494386 c 3 0.08934727 -0.1951504 NA NA
这篇关于如何模拟SQL“分区由”在R?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!