（有效地）合并随机键控子集 [英] (Efficiently) merge random keyed subset

查看：138 发布时间：2017/3/12 11:52:02 r data.table

本文介绍了（有效地）合并随机键控子集的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有两个 data.table s;我想从匹配键的那些之间随机分配一个元素给另一个。我现在这样做的方式很慢。

I have two data.tables; I'd like to assign an element of one to the other at random from among those that match keys. The way I'm doing so right now is quite slow.

让我们具体化;以下是一些示例数据：

Let's get specific; here's some sample data:

dt1<-data.table(id=sample(letters[1:5],500,replace=T),var1=rnorm(500),key="id")
dt2<-data.table(id=c(rep("a",4),rep("b",8),rep("c",2),rep("d",5),rep("e",7)),
                place=paste(sample(c("Park","Pool","Rec Center","Library"),
                                   26,replace=T),
                            sample(26)),key="id")

我想将添加两个随机选择到 dt1 对于每个观察，但地方必须匹配 id 。


I want to add two randomly chosen places to dt1 for each observation, but the places have to match on id.
这里是我现在在做什么：
Here's what I'm doing now:
get_place<-function(xx) sapply(xx,function(x) dt2[.(x),sample(place,1)])

dt1[,paste0("place",1:2):=list(get_place(id),get_place(id))]

 66秒在我的电脑上运行，基本上是一个eon。
This works, but it's quite slow--took 66 seconds to run on my computer, basically an eon.
一个问题似乎是我不能正确利用键控：
One issue seems to be I can't seem to take proper advantage of keying:
像 dt2 [。（dt1 $ id），mult =random] 会是完美的，可能。 
Something like dt2[.(dt1$id),mult="random"] would be perfect, but it doesn't appear to be possible. 
任何建议？
推荐答案
  / strong> 

A simple answer
dt2[.(dt1),as.list(c(
  place=sample(place,size=2,replace=TRUE)
)),by=.EACHI,allow.cartesian=TRUE]

这种方法很简单，它说明了 data.table 的功能，如Cartesian联接和 by = .EACHI 但是非常慢，因为对于 dt1 的每一行，它（i）采样并且（ii）将结果强制到列表。
This approach is simple and illustrates data.table features like Cartesian joins and by=.EACHI, but is very slow because for each row of dt1 it (i) samples and (ii) coerces the result to a list.
 更快的答案 
nsamp <- 2
dt3   <- dt2[.(unique(dt1$id)),list(i0=.I[1]-1L,.N),by=.EACHI]
dt1[.(dt3),paste0("place",1:nsamp):=
  replicate(nsamp,dt2$place[i0+sample(N,.N,replace=TRUE)],simplify=FALSE)
,by=.EACHI]

使用 replicate  = FALSE （也在@ bgoldst的回答中）最有意义：
Using replicate with simplify=FALSE (as also in @bgoldst's answer) makes the most sense:
 
 返回向量列表这是在创建新列时需要的格式 data.table 。 
 
  复制是重复模拟的标准R函数。
 
 

It returns a list of vectors which is the format data.table requires when making new columns. 
replicate is the standard R function for repeated simulations.

 基准。我们应该看看不同的几个功能，而不要修改 dt1 ：
Benchmarks. We should look at varying several features and not modify dt1 as we go along:
# candidate functions
frank2 <- function(){
  dt3   <- dt2[.(unique(dt1$id)),list(i0=.I[1]-1L,.N),by=.EACHI]
  dt1[.(dt3),
    replicate(nsamp,dt2$place[i0+sample(N,.N,replace=TRUE)],simplify=FALSE)
  ,by=.EACHI]
}
david2 <- function(){
  indx <- dt1[,.N, id]
  sim <- dt2[.(indx),
    replicate(2,sample(place,size=N,replace=TRUE),simplify=FALSE)
  ,by=.EACHI]
  dt1[, sim[,-1,with=FALSE]]
}
bgoldst<-function(){
  dt1[,
    replicate(2,ave(id,id,FUN=function(x) 
      sample(dt2$place[dt2$id==x[1]],length(x),replace=T)),simplify=F)
  ]
}

# simulation
size <- 1e6
nids <- 1e3
npls <- 2:15

dt1 <- data.table(id=sample(1:nids,size=size,replace=TRUE),var1=rnorm(size),key="id")
dt2 <- unique(dt1)[,list(place=sample(letters,sample(npls,1),replace=TRUE)),by=id]

# benchmarking
res <- microbenchmark(frank2(),david2(),bgoldst(),times=10)
print(res,order="cld",unit="relative")

它提供
Unit: relative
      expr      min       lq     mean   median       uq      max neval cld
 bgoldst() 8.246783 8.280276 7.090995 7.142832 6.579406 5.692655    10   b
  frank2() 1.042862 1.107311 1.074722 1.152977 1.092632 0.931651    10  a 
  david2() 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000    10  a 

如果我们切换参数。 .. 
And if we switch around the parameters...
# new simulation
size <- 1e4
nids <- 10
npls <- 1e6:2e6

dt1 <- data.table(id=sample(1:nids,size=size,replace=TRUE),var1=rnorm(size),key="id")
dt2 <- unique(dt1)[,list(place=sample(letters,sample(npls,1),replace=TRUE)),by=id]

# new benchmarking
res <- microbenchmark(frank2(),david2(),times=10)
print(res,order="cld",unit="relative")

我们看到
Unit: relative
     expr    min     lq     mean   median       uq     max neval cld
 david2() 3.3008 3.2842 3.274905 3.286772 3.280362 3.10868    10   b
 frank2() 1.0000 1.0000 1.000000 1.000000 1.000000 1.00000    10  a 

正如人们所期望的，哪种方式更快 - 折叠 dt1  中的 david2 或折叠 dt2 取决于折叠压缩的信息量。

As one might expect, which way is faster -- collapsing dt1 in david2 or collapsing dt2 in frank2 -- depends on how much information is compressed by collapsing.

                        这篇关于（有效地）合并随机键控子集的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

（有效地）合并随机键控子集 [英] (Efficiently) merge random keyed subset

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

（有效地）合并随机键控子集 [英] (Efficiently) merge random keyed subset

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭