如何模拟SQL“分区”在R? [英] How to emulate SQL "partition by" in R?

查看:92
本文介绍了如何模拟SQL“分区”在R?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

如何进行分析功能,如Oracle ROW_NUMBER(),RANK()或DENSE_RANK()函数(请参阅 http://www.orafaq.com/node/55 )在R数据帧上? CRAN包plyr非常接近但仍然不同。



我同意每个功能的功能可能会以特定的方式实现。但我主要关心的是表现。为了记忆和速度,避免使用连接或索引访问是很好的。

解决方案

data.table 包,特别是从1.8.1版开始,提供了SQL术语中分区的大部分功能。 R中的 rank(x,ties.method =min)类似于Oracle RANK()一种使用因子(如下所述)模拟 DENSE_RANK()函数的方法。一个模仿 ROW_NUMBER 的方法应该是显而易见的。



这里有一个例子:加载最新版本的 data.table from R-Forge:

  install.packages(data .table,
repos = c(http://R-Forge.R-project.org,getOption(repos)))

库(data.table)

创建一些示例数据:

  set.seed(10)

DT< -data.table(ID = seq_len(4 * 3),group = rep(1:4,each = 3) ,value = rnorm(4 * 3),
info = c(sample(c(a,b),4 * 2,replace = TRUE),
sample(c ,d),4,replace = TRUE)),key =ID)

> DT
ID组值信息
1:1 1 0.01874617 a
2:2 1 -0.18425254 b
3:3 1 -1.37133055 b
4:4 2 - 0.59916772 a
5:5 2 0.29454513 b
6:6 2 0.38979430 a
7:7 3 -1.20807618 b
8:8 3 -0.36367602 a
9: 9 3 -1.62667268 c
10:10 4 -0.25647839 d
11:11 4 1.10177950 c
12:12 4 0.75578151 d
$ $ group (注意 - 前面的表示递减顺序):

 > DT [,valRank:= rank(-value),by =group] 
ID组值信息valRank
1:1 1 0.01874617 a 1
2:2 1 -0.18425254 b 2
3:3 1 -1.37133055 b 3
4:4 2 -0.59916772 a 3
5:5 2 0.29454513 b 2
6:6 2 0.38979430 a 1
7:7 3 -1.20807618 b 2
8:8 3 -0.36367602 a 1
9:9 3 -1.62667268 c 3
10:10 4 -0.25647839 d 3
11: 11 4 1.10177950 c 1
12:12 4 0.75578151 d 2

对于 DENSE_RANK()中的值被排序,您可以将该值转换为一个因子,然后返回底层整数值。例如,根据 group 中的信息,将 ID (比较 infoRank infoRankDense ):

  DT [,infoRank:= rank(info,ties.method =min),by =group] 
DT [,infoRankDense:= as.integer(factor(info) ),by =group]

R> DT
ID组值信息valRank infoRank infoRankDense
1:1 1 0.01874617 a 1 1 1
2:2 1 -0.18425254 b 2 2 2
3:3 1 -1.37133055 b 3 2 2
4:4 2 -0.59916772 a 3 1 1
5:5 2 0.29454513 b 2 3 2
6:6 2 0.38979430 a 1 1 1
7:7 3 -1.20807618 b 2 2 2
8:8 3 -0.36367602 a 1 1 1
9:9 3 -1.62667268 c 3 3 3
10:10 4 -0.25647839 d 3 2 2
11:11 4 1.10177950 c 1 1 1
12:12 4 0.75578151 d 2 2 2

ps您好Matthew Dowle。






LEAD和LAG



为了模仿LEAD和LAG,请从此处获得答案。我将根据组内的ID顺序创建一个排名变量。这对于假冒数据并不是必需的,但是如果ID在组内不是顺序的,那么这会使生活更困难。所以这里有一些新的非连续ID伪造数据:

  set.seed(10)

DT = -data.table(ID = sample(seq_len(4 * 3)),group = rep(1:4,each = 3),value = rnorm(4 * 3),
info = (c(a,b),4 * 2,replace = TRUE),
sample(c(c,d),4,replace = TRUE)),key =ID )

DT [,idRank:= rank(ID),by =group]
setkey(DT,group,idRank)

> DT
ID组值信息idRank
1:4 1 -0.36367602 b 1
2:5 1 -1.62667268 b 2
3:7 1 -1.20807618 b 3
4:1 2 1.10177950 a 1
5:2 2 0.75578151 a 2
6:12 2 -0.25647839 b 3
7:3 3 0.74139013 c 1
8:6 3 0.98744470 b 2
9:9 3 -0.23823356 a 3
10:8 4 -0.19515038 c 1
11:10 4 0.08934727 c 2
12:11 4 -0.95494386 c 3

然后为了获取前一个记录的值,请使用 idRank 变量,并从 idRank 1 $ c>并使用 multi ='last'参数。要从上面记录的两个条目中获取值,减去 2

  DT [,prev:= DT [J(group,idRank-1),value,mult ='last']] 
DT [,prev2:= DT [J(group,idRank-2) mult ='last']]

ID组值信息idRank prev prev2
1:4 1 -0.36367602 b 1 NA NA
2:5 1 -1.62667268 b 2 -0.36367602 NA
3:7 1 -1.20807618 b 3 -1.62667268 -0.3636760
4:1 2 1.10177950 a 1 NA NA
5:2 2 0.75578151 a 2 1.10177950 NA
6:12 2 -0.25647839 b 3 0.75578151 1.1017795
7:3 3 0.74139013 c 1 NA NA
8:6 3 0.98744470 b 2 0.74139013 NA
9:9 3 -0.23823356 a 3 0.98744470 0.7413901
10:8 4 -0.19515038 c 1 NA NA
11:10 4 0.08934727 c 2 -0.19515038 NA
12:11 4 -0.95494386 c 3 0.08934727 -0.1951504

对于LEAD,添加适当的偏移到 idRank 变量并切换到 multi ='first'

  DT [,nex:= DT [J(group,idRank + 1),value,mult ='first']] 
DT [,nex2: = DT [J(group,idRank + 2),value,mult ='first']]

ID组值信息idRank prev prev2 nex nex2
1:4 1 -0.36367602 b 1 NA NA -1.62667268 -1.2080762
2:5 1 -1.62667268 b 2 -0.36367602 NA -1.20807618 NA
3:7 1 -1.20807618 b 3 -1.62667268 -0.3636760 NA NA
4:1 2 1.10177950 a 1 NA NA 0.75578151 -0.2564784
5:2 2 0.75578151 a 2 1.10177950 NA -0.25647839 NA
6:12 2 -0.25647839 b 3 0.75578151 1.1017795 NA NA
7:3 3 0.74139013 c 1 NA NA 0.98744470 -0.2382336
8:6 3 0.98744470 b 2 0.74139013 NA -0.23823356 NA
9:9 3 -0.23823356 a 3 0.98744470 0.7413901 NA NA
10:8 4 -0.19515038 c 1 NA NA 0.08934727 -0.9549439
11:10 4 0.08934727 c 2 -0.19515038 NA -0.95494386 NA
12:11 4 -0.95494386 c 3 0.08934727 -0.1951504 NA NA


How can I do analytic functions like the Oracle ROW_NUMBER(), RANK(), or DENSE_RANK() functions (see http://www.orafaq.com/node/55) on a R data frame? The CRAN package "plyr" is very close but is still different.

I agree that the functionality of each function can potentially be achieved in an ad-hoc fashion. But my main concern is the performance. It would be good to avoid using join or indexing access, for the sake of memory and speed.

解决方案

The data.table package, especially starting with version 1.8.1, offers much of the functionality of partition in SQL terms. rank(x, ties.method = "min") in R is similar to Oracle RANK(), and there's a way using factors (described below) to mimic the DENSE_RANK() function. A way to mimic ROW_NUMBER should be obvious by the end.

Here's an example: Load the latest version of data.table from R-Forge:

install.packages("data.table",
  repos= c("http://R-Forge.R-project.org", getOption("repos")))

library(data.table)

Create some example data:

set.seed(10)

DT<-data.table(ID=seq_len(4*3),group=rep(1:4,each=3),value=rnorm(4*3),
  info=c(sample(c("a","b"),4*2,replace=TRUE),
  sample(c("c","d"),4,replace=TRUE)),key="ID")

> DT
    ID group       value info
 1:  1     1  0.01874617    a
 2:  2     1 -0.18425254    b
 3:  3     1 -1.37133055    b
 4:  4     2 -0.59916772    a
 5:  5     2  0.29454513    b
 6:  6     2  0.38979430    a
 7:  7     3 -1.20807618    b
 8:  8     3 -0.36367602    a
 9:  9     3 -1.62667268    c
10: 10     4 -0.25647839    d
11: 11     4  1.10177950    c
12: 12     4  0.75578151    d

Rank each ID by decreasing value within group (note the - in front of value to denote decreasing order):

> DT[,valRank:=rank(-value),by="group"]
    ID group       value info valRank
 1:  1     1  0.01874617    a       1
 2:  2     1 -0.18425254    b       2
 3:  3     1 -1.37133055    b       3
 4:  4     2 -0.59916772    a       3
 5:  5     2  0.29454513    b       2
 6:  6     2  0.38979430    a       1
 7:  7     3 -1.20807618    b       2
 8:  8     3 -0.36367602    a       1
 9:  9     3 -1.62667268    c       3
10: 10     4 -0.25647839    d       3
11: 11     4  1.10177950    c       1
12: 12     4  0.75578151    d       2

For DENSE_RANK() with ties in the value being ranked, you could convert the value to a factor and then return the underlying integer values. For example, ranking each ID based on info within group (compare infoRank with infoRankDense):

DT[,infoRank:=rank(info,ties.method="min"),by="group"]
DT[,infoRankDense:=as.integer(factor(info)),by="group"]

R> DT
    ID group       value info valRank infoRank infoRankDense
 1:  1     1  0.01874617    a       1        1             1
 2:  2     1 -0.18425254    b       2        2             2
 3:  3     1 -1.37133055    b       3        2             2
 4:  4     2 -0.59916772    a       3        1             1
 5:  5     2  0.29454513    b       2        3             2
 6:  6     2  0.38979430    a       1        1             1
 7:  7     3 -1.20807618    b       2        2             2
 8:  8     3 -0.36367602    a       1        1             1
 9:  9     3 -1.62667268    c       3        3             3
10: 10     4 -0.25647839    d       3        2             2
11: 11     4  1.10177950    c       1        1             1
12: 12     4  0.75578151    d       2        2             2

p.s. Hi Matthew Dowle.


LEAD and LAG

For imitating LEAD and LAG, start with the answer provided here. I would create a rank variable based on the order of IDs within groups. This wouldn't be necessary with the fake data as above, but if the IDs are not in sequential order within groups, then this would make life a bit more difficult. So here's some new fake data with non-sequential IDs:

set.seed(10)

DT<-data.table(ID=sample(seq_len(4*3)),group=rep(1:4,each=3),value=rnorm(4*3),
  info=c(sample(c("a","b"),4*2,replace=TRUE),
  sample(c("c","d"),4,replace=TRUE)),key="ID")

DT[,idRank:=rank(ID),by="group"]
setkey(DT,group, idRank)

> DT
    ID group       value info idRank
 1:  4     1 -0.36367602    b      1
 2:  5     1 -1.62667268    b      2
 3:  7     1 -1.20807618    b      3
 4:  1     2  1.10177950    a      1
 5:  2     2  0.75578151    a      2
 6: 12     2 -0.25647839    b      3
 7:  3     3  0.74139013    c      1
 8:  6     3  0.98744470    b      2
 9:  9     3 -0.23823356    a      3
10:  8     4 -0.19515038    c      1
11: 10     4  0.08934727    c      2
12: 11     4 -0.95494386    c      3

Then to get the values of the previous 1 record, use the group and idRank variables and subtract 1 from the idRank and use the multi = 'last' argument. To get the value from the record two entries above, subtract 2.

DT[,prev:=DT[J(group,idRank-1), value, mult='last']]
DT[,prev2:=DT[J(group,idRank-2), value, mult='last']]

    ID group       value info idRank        prev      prev2
 1:  4     1 -0.36367602    b      1          NA         NA
 2:  5     1 -1.62667268    b      2 -0.36367602         NA
 3:  7     1 -1.20807618    b      3 -1.62667268 -0.3636760
 4:  1     2  1.10177950    a      1          NA         NA
 5:  2     2  0.75578151    a      2  1.10177950         NA
 6: 12     2 -0.25647839    b      3  0.75578151  1.1017795
 7:  3     3  0.74139013    c      1          NA         NA
 8:  6     3  0.98744470    b      2  0.74139013         NA
 9:  9     3 -0.23823356    a      3  0.98744470  0.7413901
10:  8     4 -0.19515038    c      1          NA         NA
11: 10     4  0.08934727    c      2 -0.19515038         NA
12: 11     4 -0.95494386    c      3  0.08934727 -0.1951504

For LEAD, add the appropriate offset to the idRank variable and switch to multi = 'first':

DT[,nex:=DT[J(group,idRank+1), value, mult='first']]
DT[,nex2:=DT[J(group,idRank+2), value, mult='first']]

    ID group       value info idRank        prev      prev2         nex       nex2
 1:  4     1 -0.36367602    b      1          NA         NA -1.62667268 -1.2080762
 2:  5     1 -1.62667268    b      2 -0.36367602         NA -1.20807618         NA
 3:  7     1 -1.20807618    b      3 -1.62667268 -0.3636760          NA         NA
 4:  1     2  1.10177950    a      1          NA         NA  0.75578151 -0.2564784
 5:  2     2  0.75578151    a      2  1.10177950         NA -0.25647839         NA
 6: 12     2 -0.25647839    b      3  0.75578151  1.1017795          NA         NA
 7:  3     3  0.74139013    c      1          NA         NA  0.98744470 -0.2382336
 8:  6     3  0.98744470    b      2  0.74139013         NA -0.23823356         NA
 9:  9     3 -0.23823356    a      3  0.98744470  0.7413901          NA         NA
10:  8     4 -0.19515038    c      1          NA         NA  0.08934727 -0.9549439
11: 10     4  0.08934727    c      2 -0.19515038         NA -0.95494386         NA
12: 11     4 -0.95494386    c      3  0.08934727 -0.1951504          NA         NA

这篇关于如何模拟SQL“分区”在R?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆