(有效地)合并随机键控子集 [英] (Efficiently) merge random keyed subset
问题描述
我有两个 data.table
s;我想从匹配键的那些之间随机分配一个元素给另一个。我现在这样做的方式很慢。
I have two data.table
s; I'd like to assign an element of one to the other at random from among those that match keys. The way I'm doing so right now is quite slow.
让我们具体化;以下是一些示例数据:
Let's get specific; here's some sample data:
dt1<-data.table(id=sample(letters[1:5],500,replace=T),var1=rnorm(500),key="id")
dt2<-data.table(id=c(rep("a",4),rep("b",8),rep("c",2),rep("d",5),rep("e",7)),
place=paste(sample(c("Park","Pool","Rec Center","Library"),
26,replace=T),
sample(26)),key="id")
我想将添加两个随机选择
到
dt1
对于每个观察,但地方
必须匹配 id
。
I want to add two randomly chosen place
s to dt1
for each observation, but the place
s have to match on id
.
这里是我现在在做什么:
Here's what I'm doing now:
get_place<-function(xx) sapply(xx,function(x) dt2[.(x),sample(place,1)])
dt1[,paste0("place",1:2):=list(get_place(id),get_place(id))]
66秒在我的电脑上运行,基本上是一个eon。
This works, but it's quite slow--took 66 seconds to run on my computer, basically an eon.
一个问题似乎是我不能正确利用键控:
One issue seems to be I can't seem to take proper advantage of keying:
像 dt2 [。(dt1 $ id),mult =random]
会是完美的,可能。
Something like dt2[.(dt1$id),mult="random"]
would be perfect, but it doesn't appear to be possible.
任何建议?
推荐答案
/ strong>
A simple answer
dt2[.(dt1),as.list(c(
place=sample(place,size=2,replace=TRUE)
)),by=.EACHI,allow.cartesian=TRUE]
这种方法很简单,它说明了 data.table
的功能,如Cartesian联接和 by = .EACHI
但是非常慢,因为对于 dt1
的每一行,它(i)采样并且(ii)将结果强制到列表。
This approach is simple and illustrates data.table
features like Cartesian joins and by=.EACHI
, but is very slow because for each row of dt1
it (i) samples and (ii) coerces the result to a list.
更快的答案
nsamp <- 2
dt3 <- dt2[.(unique(dt1$id)),list(i0=.I[1]-1L,.N),by=.EACHI]
dt1[.(dt3),paste0("place",1:nsamp):=
replicate(nsamp,dt2$place[i0+sample(N,.N,replace=TRUE)],simplify=FALSE)
,by=.EACHI]
使用 replicate
= FALSE (也在@ bgoldst的回答中)最有意义:
Using replicate
with simplify=FALSE
(as also in @bgoldst's answer) makes the most sense:
- 返回向量列表这是在创建新列时需要的格式
data.table
。 -
复制
是重复模拟的标准R函数。
- It returns a list of vectors which is the format
data.table
requires when making new columns. replicate
is the standard R function for repeated simulations.
基准。我们应该看看不同的几个功能,而不要修改 dt1
:
Benchmarks. We should look at varying several features and not modify dt1
as we go along:
# candidate functions
frank2 <- function(){
dt3 <- dt2[.(unique(dt1$id)),list(i0=.I[1]-1L,.N),by=.EACHI]
dt1[.(dt3),
replicate(nsamp,dt2$place[i0+sample(N,.N,replace=TRUE)],simplify=FALSE)
,by=.EACHI]
}
david2 <- function(){
indx <- dt1[,.N, id]
sim <- dt2[.(indx),
replicate(2,sample(place,size=N,replace=TRUE),simplify=FALSE)
,by=.EACHI]
dt1[, sim[,-1,with=FALSE]]
}
bgoldst<-function(){
dt1[,
replicate(2,ave(id,id,FUN=function(x)
sample(dt2$place[dt2$id==x[1]],length(x),replace=T)),simplify=F)
]
}
# simulation
size <- 1e6
nids <- 1e3
npls <- 2:15
dt1 <- data.table(id=sample(1:nids,size=size,replace=TRUE),var1=rnorm(size),key="id")
dt2 <- unique(dt1)[,list(place=sample(letters,sample(npls,1),replace=TRUE)),by=id]
# benchmarking
res <- microbenchmark(frank2(),david2(),bgoldst(),times=10)
print(res,order="cld",unit="relative")
它提供
Unit: relative
expr min lq mean median uq max neval cld
bgoldst() 8.246783 8.280276 7.090995 7.142832 6.579406 5.692655 10 b
frank2() 1.042862 1.107311 1.074722 1.152977 1.092632 0.931651 10 a
david2() 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 10 a
如果我们切换参数。 ..
And if we switch around the parameters...
# new simulation
size <- 1e4
nids <- 10
npls <- 1e6:2e6
dt1 <- data.table(id=sample(1:nids,size=size,replace=TRUE),var1=rnorm(size),key="id")
dt2 <- unique(dt1)[,list(place=sample(letters,sample(npls,1),replace=TRUE)),by=id]
# new benchmarking
res <- microbenchmark(frank2(),david2(),times=10)
print(res,order="cld",unit="relative")
我们看到
Unit: relative
expr min lq mean median uq max neval cld
david2() 3.3008 3.2842 3.274905 3.286772 3.280362 3.10868 10 b
frank2() 1.0000 1.0000 1.000000 1.000000 1.000000 1.00000 10 a
正如人们所期望的,哪种方式更快 - 折叠 dt1
中的
david2
或折叠 dt2
取决于折叠压缩的信息量。
As one might expect, which way is faster -- collapsing dt1
in david2
or collapsing dt2
in frank2
-- depends on how much information is compressed by collapsing.
这篇关于(有效地)合并随机键控子集的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!