r - 对数据表的每一行应用函数 [英] r - apply function to each row of a data.table

查看:203
本文介绍了r - 对数据表的每一行应用函数的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想使用 data.table 来提高给定函数的速度,但我不确定我是否以正确的方式实现它:



数据



给定两个 data.table s( dt dt_lookup

  library(data.table)
set.seed(1234)
t< - seq(1,100); l - 字母; la < - letters [1:13]; lb < - letters [14:26]
n < - 10000
dt < - data.table(id = seq(1:n),
thisTime = sample ,replace = TRUE),
thisLocation = sample(la,n,replace = TRUE),
finalLocation = sample(lb,n,replace = TRUE)

set.seed(4321)
dt_lookup < - data.table(lkpId = paste0(l - ,seq(1,1000)),
lkpTime = sample (t,10000,replace = TRUE),
lkpLocation = sample(l,10000,replace = TRUE))
##注意:lkpId是循环使用
setkey(dt_lookup,lkpLocation)

我有一个函数可以找到 lkpId 包含 thisLocation finalLocation ,并且具有'nearest' lkpTime (即 thisTime - lkpTime 的最小非负值)



>

  ##函数获取'next'lkpId带有thisLocation和finalLocation的lkpId,
##和thisTime和dt_lookup之间的最小非负时间$ lkpTime)
getId < - function(thisTime,thisLocation,finalLocation){

##基于thisLocation和finalLocation,
##的过滤器查找,并且只返回lkpId具有'this'和'final'位置的值
tempThis< - unique(dt_lookup [lkpLocation = = thisLocation,lkpId])
tempFinal < - unique(dt_lookup [lkpLocation == finalLocation,lkpId])
availServices < - tempThis [tempThis%in%tempFinal]

tempThisFinal< - dt_lookup [lkpId%in%availServices& lkpLocation == thisLocation,。(lkpId,lkpTime)]

## calcualte'thisTime'和'lkpTime'之间的时间差(来自thisLocation)
temp2 < - thisTime - tempThisFinal $ lkpTime

##取具有最小非负差分的lkpId
selectedId< - tempThisFinal [min(which(temp2 == min(temp2 [temp2> 0]))),lkpId ]
selectedId
}

尝试解决方案



我需要为 dt lkpId $ c>。因此,我的初始本能是使用 * apply 函数,但是对于 n / nrow> 1,000,000 。所以我试图实现一个 data.table 解决方案,看看它是否更快:

  selectedId < -  dt [,。(lkpId = getId(thisTime,thisLocation,finalLocation)),by = id] 

但是,我是 data.table 的新用户,并且这个方法看起来不会比 * apply 解决方案:

  lkpIds<  -  apply (x){
thisLocation< - as.character(x [[ThisLocation]])
finalLocation < - as.character(x [[finalLocation]])
thisTime < - as.numeric(x [[thisTime]])
myId < - getId(thisTime,thisLocation,finalLocation)
})

两者花费约30秒(n = 10,000)。



问题



有更好的方法使用 data.table dt 的每一行上应用 getId p>

更新12/08/2015



我重新设计了我的整个算法,并利用滚动连接(一个好的介绍) ),从而正确使用 data.table

解决方案

自从提出这个问题以来,我们一直在研究 data.table 必须提供,研究 data.table 连接感谢@ eddi的指针(例如在数据表上滚动连接,以及 inner join with inequality ),我想出了一个解决方案。



一个棘手的部分是从'apply每行的函数,并重新设计解决方案以使用连接。



并且,毫无疑问,这将是更好的编程方式,但这里是我的尝试。

  ##想要为每个id找到一个lkpId,它在'thisTime'和'lkpTime'之间的最小差异
##,其中lkpId包含'thisLocation'和'finalLocation'

##找到所有查找id的'thisLocation'匹配'lookupLocation'
##和whereTime - lkpTime> 0
setkey(dt,thisLocation)
setkey(dt_lookup,lkpLocation)

dt_this < - dt [dt_lookup,{
idx = thisTime - i.lkpTime> ; 0
。(id = id [idx],
lkpId = i.lkpId,
thisTime = thisTime [idx],
lkpTime = i.lkpTime)
} ,
by = .EACHI]

## remove NAs
dt_this < - dt_this [complete.cases(dt_this)]

## find所有匹配的'finalLocation'和'lookupLocaiton'
setkey(dt,finalLocation)
##内部连接(并且只返回id列)
dt_final < - dt [dt_lookup,nomatch = 0 ,allow.cartesian = TRUE] [,。(id,lkpId)]

##将dt_this连接到dt_final(因为lkpId必须同时具有'thisLocation'和'finalLocation')
setkey dt_this,id,lkpId)
setkey(dt_final,id,lkpId)

dt_join < - dt_this [dt_final,nomatch = 0]

##组合'thisTime'和'lkpTime'之间的最小差异
dt_join [,timeDiff:= thisTime-lkpTime]

dt_join
##等效的dplyr代码
#library(dplyr)
#dt_this< - dt_this%>%
#group_by(id)%>%
#arrange(timeDiff)%>%
#slice(1)%>%
#ungroup


I'm looking to use data.table to improve speed for a given function, but I'm not sure I'm implementing it the correct way:

Data

Given two data.tables (dt and dt_lookup)

library(data.table)
set.seed(1234)
t <- seq(1,100); l <- letters; la <- letters[1:13]; lb <- letters[14:26]
n <- 10000
dt <- data.table(id=seq(1:n), 
                 thisTime=sample(t, n, replace=TRUE), 
                 thisLocation=sample(la,n,replace=TRUE),
                 finalLocation=sample(lb,n,replace=TRUE))
setkey(dt, thisLocation)

set.seed(4321)
dt_lookup <- data.table(lkpId = paste0("l-",seq(1,1000)),
                        lkpTime=sample(t, 10000, replace=TRUE),
                        lkpLocation=sample(l, 10000, replace=TRUE))
## NOTE: lkpId is purposly recycled
setkey(dt_lookup, lkpLocation)

I have a function that finds the lkpId that contains both thisLocation and finalLocation, and has the 'nearest' lkpTime (i.e. the minimum non-negative value of thisTime - lkpTime)

Function

## function to get the 'next' lkpId (i.e. the lkpId with both thisLocation and finalLocation,
## with the minimum non-negative time between thisTime and dt_lookup$lkpTime)
getId <- function(thisTime, thisLocation, finalLocation){

  ## filter lookup based on thisLocation and finalLocation,
  ## and only return values where the lkpId has both 'this' and 'final' locations
  tempThis <- unique(dt_lookup[lkpLocation == thisLocation,lkpId])
  tempFinal <- unique(dt_lookup[lkpLocation == finalLocation,lkpId])
  availServices <- tempThis[tempThis %in% tempFinal]

  tempThisFinal <- dt_lookup[lkpId %in% availServices & lkpLocation==thisLocation, .(lkpId, lkpTime)]

  ## calcualte time difference between 'thisTime' and 'lkpTime' (from thisLocation)
  temp2 <- thisTime - tempThisFinal$lkpTime

  ## take the lkpId with the minimum non-negative difference
  selectedId <- tempThisFinal[min(which(temp2==min(temp2[temp2>0]))),lkpId]
  selectedId
}

Attempts at a solution

I need to get the lkpId for each row of dt. Therefore, my initial instinct was to use an *apply function, but it was taking too long (for me) when n/nrow > 1,000,000. So I've tried to implement a data.table solution to see if it's faster:

selectedId <- dt[,.(lkpId = getId(thisTime, thisLocation, finalLocation)),by=id]

However, I'm fairly new to data.table, and this method doesn't appear to give any performance gains over an *apply solution:

lkpIds <- apply(dt, 1, function(x){
  thisLocation <- as.character(x[["thisLocation"]])
  finalLocation <- as.character(x[["finalLocation"]])
  thisTime <- as.numeric(x[["thisTime"]])
  myId <- getId(thisTime, thisLocation, finalLocation)
})

both taking ~30 seconds for n = 10,000.

Question

Is there a better way of using data.table to apply the getId function over each row of dt ?

Update 12/08/2015

Thanks to the pointer from @eddi I've redesigned my whole algorithm and am making use of rolling joins (a good introduction), thus making proper use of data.table. I'll write up an answer later.

解决方案

Having spent the time since asking this question looking into what data.table has to offer, researching data.table joins thanks to @eddi's pointer (for example Rolling join on data.table, and inner join with inequality), I've come up with a solution.

One of the tricky parts was moving away from the thought of 'apply a function to each row', and redesigning the solution to use joins.

And, there will no doubt be better ways of programming this, but here's my attempt.

## want to find a lkpId for each id, that has the minimum difference between 'thisTime' and 'lkpTime'
## and where the lkpId contains both 'thisLocation' and 'finalLocation'

## find all lookup id's where 'thisLocation' matches 'lookupLocation'
## and where thisTime - lkpTime > 0
setkey(dt, thisLocation)
setkey(dt_lookup, lkpLocation)

dt_this <- dt[dt_lookup, {
  idx = thisTime - i.lkpTime > 0
  .(id = id[idx],
    lkpId = i.lkpId,
    thisTime = thisTime[idx],
    lkpTime = i.lkpTime)
},
by=.EACHI]

## remove NAs
dt_this <- dt_this[complete.cases(dt_this)]

## find all matching 'finalLocation' and 'lookupLocaiton'
setkey(dt, finalLocation)
## inner join (and only return the id columns)
dt_final <- dt[dt_lookup, nomatch=0, allow.cartesian=TRUE][,.(id, lkpId)]

## join dt_this to dt_final (as lkpId must have both 'thisLocation' and 'finalLocation')
setkey(dt_this, id, lkpId)
setkey(dt_final, id, lkpId)

dt_join <- dt_this[dt_final, nomatch=0]

## take the combination with the minimum difference between 'thisTime' and 'lkpTime'
dt_join[,timeDiff := thisTime - lkpTime]

dt_join <- dt_join[ dt_join[order(timeDiff), .I[1], by=id]$V1]  

## equivalent dplyr code
# library(dplyr)
# dt_this <- dt_this %>%
#   group_by(id) %>%
#   arrange(timeDiff) %>%
#   slice(1) %>%
#   ungroup 

这篇关于r - 对数据表的每一行应用函数的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆