(有效地)合并随机键控子集 [英] (Efficiently) merge random keyed subset

查看:138
本文介绍了(有效地)合并随机键控子集的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有两个 data.table s;我想从匹配键的那些之间随机分配一个元素给另一个。我现在这样做的方式很慢。

I have two data.tables; I'd like to assign an element of one to the other at random from among those that match keys. The way I'm doing so right now is quite slow.

让我们具体化;以下是一些示例数据:

Let's get specific; here's some sample data:

dt1<-data.table(id=sample(letters[1:5],500,replace=T),var1=rnorm(500),key="id")
dt2<-data.table(id=c(rep("a",4),rep("b",8),rep("c",2),rep("d",5),rep("e",7)),
                place=paste(sample(c("Park","Pool","Rec Center","Library"),
                                   26,replace=T),
                            sample(26)),key="id")

我想将添加两个随机选择 dt1 对于每个观察,但地方必须匹配 id

I want to add two randomly chosen places to dt1 for each observation, but the places have to match on id.

这里是我现在在做什么:

Here's what I'm doing now:

get_place<-function(xx) sapply(xx,function(x) dt2[.(x),sample(place,1)])

dt1[,paste0("place",1:2):=list(get_place(id),get_place(id))]

66秒在我的电脑上运行,基本上是一个eon。

This works, but it's quite slow--took 66 seconds to run on my computer, basically an eon.

一个问题似乎是我不能正确利用键控:

One issue seems to be I can't seem to take proper advantage of keying:

dt2 [。(dt1 $ id),mult =random] 会是完美的,可能。

Something like dt2[.(dt1$id),mult="random"] would be perfect, but it doesn't appear to be possible.

任何建议?

推荐答案

/ strong>

A simple answer

dt2[.(dt1),as.list(c(
  place=sample(place,size=2,replace=TRUE)
)),by=.EACHI,allow.cartesian=TRUE]

这种方法很简单,它说明了 data.table 的功能,如Cartesian联接和 by = .EACHI 但是非常慢,因为对于 dt1 的每一行,它(i)采样并且(ii)将结果强制到列表。

This approach is simple and illustrates data.table features like Cartesian joins and by=.EACHI, but is very slow because for each row of dt1 it (i) samples and (ii) coerces the result to a list.

更快的答案

nsamp <- 2
dt3   <- dt2[.(unique(dt1$id)),list(i0=.I[1]-1L,.N),by=.EACHI]
dt1[.(dt3),paste0("place",1:nsamp):=
  replicate(nsamp,dt2$place[i0+sample(N,.N,replace=TRUE)],simplify=FALSE)
,by=.EACHI]

使用 replicate = FALSE (也在@ bgoldst的回答中)最有意义:

Using replicate with simplify=FALSE (as also in @bgoldst's answer) makes the most sense:


  • 返回向量列表这是在创建新列时需要的格式 data.table

  • 复制是重复模拟的标准R函数。

  • It returns a list of vectors which is the format data.table requires when making new columns.
  • replicate is the standard R function for repeated simulations.

基准。我们应该看看不同的几个功能,而不要修改 dt1

Benchmarks. We should look at varying several features and not modify dt1 as we go along:

# candidate functions
frank2 <- function(){
  dt3   <- dt2[.(unique(dt1$id)),list(i0=.I[1]-1L,.N),by=.EACHI]
  dt1[.(dt3),
    replicate(nsamp,dt2$place[i0+sample(N,.N,replace=TRUE)],simplify=FALSE)
  ,by=.EACHI]
}
david2 <- function(){
  indx <- dt1[,.N, id]
  sim <- dt2[.(indx),
    replicate(2,sample(place,size=N,replace=TRUE),simplify=FALSE)
  ,by=.EACHI]
  dt1[, sim[,-1,with=FALSE]]
}
bgoldst<-function(){
  dt1[,
    replicate(2,ave(id,id,FUN=function(x) 
      sample(dt2$place[dt2$id==x[1]],length(x),replace=T)),simplify=F)
  ]
}

# simulation
size <- 1e6
nids <- 1e3
npls <- 2:15

dt1 <- data.table(id=sample(1:nids,size=size,replace=TRUE),var1=rnorm(size),key="id")
dt2 <- unique(dt1)[,list(place=sample(letters,sample(npls,1),replace=TRUE)),by=id]

# benchmarking
res <- microbenchmark(frank2(),david2(),bgoldst(),times=10)
print(res,order="cld",unit="relative")

它提供

Unit: relative
      expr      min       lq     mean   median       uq      max neval cld
 bgoldst() 8.246783 8.280276 7.090995 7.142832 6.579406 5.692655    10   b
  frank2() 1.042862 1.107311 1.074722 1.152977 1.092632 0.931651    10  a 
  david2() 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000    10  a 

如果我们切换参数。 ..

And if we switch around the parameters...

# new simulation
size <- 1e4
nids <- 10
npls <- 1e6:2e6

dt1 <- data.table(id=sample(1:nids,size=size,replace=TRUE),var1=rnorm(size),key="id")
dt2 <- unique(dt1)[,list(place=sample(letters,sample(npls,1),replace=TRUE)),by=id]

# new benchmarking
res <- microbenchmark(frank2(),david2(),times=10)
print(res,order="cld",unit="relative")

我们看到

Unit: relative
     expr    min     lq     mean   median       uq     max neval cld
 david2() 3.3008 3.2842 3.274905 3.286772 3.280362 3.10868    10   b
 frank2() 1.0000 1.0000 1.000000 1.000000 1.000000 1.00000    10  a 

正如人们所期望的,哪种方式更快 - 折叠 dt1 中的 david2 或折叠 dt2 取决于折叠压缩的信息量。

As one might expect, which way is faster -- collapsing dt1 in david2 or collapsing dt2 in frank2 -- depends on how much information is compressed by collapsing.

这篇关于(有效地)合并随机键控子集的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆