r - 对数据表的每一行应用函数 [英] r - apply function to each row of a data.table
问题描述
我想使用 data.table
来提高给定函数的速度,但我不确定我是否以正确的方式实现它:
数据
给定两个 data.table
s( dt
和 dt_lookup
)
library(data.table)
set.seed(1234)
t< - seq(1,100); l - 字母; la < - letters [1:13]; lb < - letters [14:26]
n < - 10000
dt < - data.table(id = seq(1:n),
thisTime = sample ,replace = TRUE),
thisLocation = sample(la,n,replace = TRUE),
finalLocation = sample(lb,n,replace = TRUE)
set.seed(4321)
dt_lookup < - data.table(lkpId = paste0(l - ,seq(1,1000)),
lkpTime = sample (t,10000,replace = TRUE),
lkpLocation = sample(l,10000,replace = TRUE))
##注意:lkpId是循环使用
setkey(dt_lookup,lkpLocation)
我有一个函数可以找到 lkpId
包含 thisLocation
和 finalLocation
,并且具有'nearest' lkpTime
(即 thisTime - lkpTime
的最小非负值)
>
##函数获取'next'lkpId带有thisLocation和finalLocation的lkpId,
##和thisTime和dt_lookup之间的最小非负时间$ lkpTime)
getId < - function(thisTime,thisLocation,finalLocation){
##基于thisLocation和finalLocation,
##的过滤器查找,并且只返回lkpId具有'this'和'final'位置的值
tempThis< - unique(dt_lookup [lkpLocation = = thisLocation,lkpId])
tempFinal < - unique(dt_lookup [lkpLocation == finalLocation,lkpId])
availServices < - tempThis [tempThis%in%tempFinal]
tempThisFinal< - dt_lookup [lkpId%in%availServices& lkpLocation == thisLocation,。(lkpId,lkpTime)]
## calcualte'thisTime'和'lkpTime'之间的时间差(来自thisLocation)
temp2 < - thisTime - tempThisFinal $ lkpTime
##取具有最小非负差分的lkpId
selectedId< - tempThisFinal [min(which(temp2 == min(temp2 [temp2> 0]))),lkpId ]
selectedId
}
尝试解决方案
我需要为 dt $ c的每一行获取
lkpId
$ c>。因此,我的初始本能是使用 * apply
函数,但是对于 n / nrow> 1,000,000
。所以我试图实现一个 data.table
解决方案,看看它是否更快:
selectedId < - dt [,。(lkpId = getId(thisTime,thisLocation,finalLocation)),by = id]
但是,我是 data.table
的新用户,并且这个方法看起来不会比 * apply
解决方案:
lkpIds< - apply (x){
thisLocation< - as.character(x [[ThisLocation]])
finalLocation < - as.character(x [[finalLocation]])
thisTime < - as.numeric(x [[thisTime]])
myId < - getId(thisTime,thisLocation,finalLocation)
})
两者花费约30秒(n = 10,000)。
问题
有更好的方法使用 data.table
在 dt
的每一行上应用 getId
p>
更新12/08/2015
我重新设计了我的整个算法,并利用滚动连接(一个好的介绍) ),从而正确使用 data.table
。
自从提出这个问题以来,我们一直在研究 data.table
必须提供,研究 data.table
连接感谢@ eddi的指针(例如在数据表上滚动连接,以及 inner join with inequality ),我想出了一个解决方案。
一个棘手的部分是从'apply每行的函数,并重新设计解决方案以使用连接。
并且,毫无疑问,这将是更好的编程方式,但这里是我的尝试。
##想要为每个id找到一个lkpId,它在'thisTime'和'lkpTime'之间的最小差异
##,其中lkpId包含'thisLocation'和'finalLocation'
##找到所有查找id的'thisLocation'匹配'lookupLocation'
##和whereTime - lkpTime> 0
setkey(dt,thisLocation)
setkey(dt_lookup,lkpLocation)
dt_this < - dt [dt_lookup,{
idx = thisTime - i.lkpTime> ; 0
。(id = id [idx],
lkpId = i.lkpId,
thisTime = thisTime [idx],
lkpTime = i.lkpTime)
} ,
by = .EACHI]
## remove NAs
dt_this < - dt_this [complete.cases(dt_this)]
## find所有匹配的'finalLocation'和'lookupLocaiton'
setkey(dt,finalLocation)
##内部连接(并且只返回id列)
dt_final < - dt [dt_lookup,nomatch = 0 ,allow.cartesian = TRUE] [,。(id,lkpId)]
##将dt_this连接到dt_final(因为lkpId必须同时具有'thisLocation'和'finalLocation')
setkey dt_this,id,lkpId)
setkey(dt_final,id,lkpId)
dt_join < - dt_this [dt_final,nomatch = 0]
##组合'thisTime'和'lkpTime'之间的最小差异
dt_join [,timeDiff:= thisTime-lkpTime]
dt_join
##等效的dplyr代码
#library(dplyr)
#dt_this< - dt_this%>%
#group_by(id)%>%
#arrange(timeDiff)%>%
#slice(1)%>%
#ungroup
I'm looking to use data.table
to improve speed for a given function, but I'm not sure I'm implementing it the correct way:
Data
Given two data.table
s (dt
and dt_lookup
)
library(data.table)
set.seed(1234)
t <- seq(1,100); l <- letters; la <- letters[1:13]; lb <- letters[14:26]
n <- 10000
dt <- data.table(id=seq(1:n),
thisTime=sample(t, n, replace=TRUE),
thisLocation=sample(la,n,replace=TRUE),
finalLocation=sample(lb,n,replace=TRUE))
setkey(dt, thisLocation)
set.seed(4321)
dt_lookup <- data.table(lkpId = paste0("l-",seq(1,1000)),
lkpTime=sample(t, 10000, replace=TRUE),
lkpLocation=sample(l, 10000, replace=TRUE))
## NOTE: lkpId is purposly recycled
setkey(dt_lookup, lkpLocation)
I have a function that finds the lkpId
that contains both thisLocation
and finalLocation
, and has the 'nearest' lkpTime
(i.e. the minimum non-negative value of thisTime - lkpTime
)
Function
## function to get the 'next' lkpId (i.e. the lkpId with both thisLocation and finalLocation,
## with the minimum non-negative time between thisTime and dt_lookup$lkpTime)
getId <- function(thisTime, thisLocation, finalLocation){
## filter lookup based on thisLocation and finalLocation,
## and only return values where the lkpId has both 'this' and 'final' locations
tempThis <- unique(dt_lookup[lkpLocation == thisLocation,lkpId])
tempFinal <- unique(dt_lookup[lkpLocation == finalLocation,lkpId])
availServices <- tempThis[tempThis %in% tempFinal]
tempThisFinal <- dt_lookup[lkpId %in% availServices & lkpLocation==thisLocation, .(lkpId, lkpTime)]
## calcualte time difference between 'thisTime' and 'lkpTime' (from thisLocation)
temp2 <- thisTime - tempThisFinal$lkpTime
## take the lkpId with the minimum non-negative difference
selectedId <- tempThisFinal[min(which(temp2==min(temp2[temp2>0]))),lkpId]
selectedId
}
Attempts at a solution
I need to get the lkpId
for each row of dt
. Therefore, my initial instinct was to use an *apply
function, but it was taking too long (for me) when n/nrow > 1,000,000
. So I've tried to implement a data.table
solution to see if it's faster:
selectedId <- dt[,.(lkpId = getId(thisTime, thisLocation, finalLocation)),by=id]
However, I'm fairly new to data.table
, and this method doesn't appear to give any performance gains over an *apply
solution:
lkpIds <- apply(dt, 1, function(x){
thisLocation <- as.character(x[["thisLocation"]])
finalLocation <- as.character(x[["finalLocation"]])
thisTime <- as.numeric(x[["thisTime"]])
myId <- getId(thisTime, thisLocation, finalLocation)
})
both taking ~30 seconds for n = 10,000.
Question
Is there a better way of using data.table
to apply the getId
function over each row of dt
?
Update 12/08/2015
Thanks to the pointer from @eddi I've redesigned my whole algorithm and am making use of rolling joins (a good introduction), thus making proper use of data.table
. I'll write up an answer later.
Having spent the time since asking this question looking into what data.table
has to offer, researching data.table
joins thanks to @eddi's pointer (for example Rolling join on data.table, and inner join with inequality), I've come up with a solution.
One of the tricky parts was moving away from the thought of 'apply a function to each row', and redesigning the solution to use joins.
And, there will no doubt be better ways of programming this, but here's my attempt.
## want to find a lkpId for each id, that has the minimum difference between 'thisTime' and 'lkpTime'
## and where the lkpId contains both 'thisLocation' and 'finalLocation'
## find all lookup id's where 'thisLocation' matches 'lookupLocation'
## and where thisTime - lkpTime > 0
setkey(dt, thisLocation)
setkey(dt_lookup, lkpLocation)
dt_this <- dt[dt_lookup, {
idx = thisTime - i.lkpTime > 0
.(id = id[idx],
lkpId = i.lkpId,
thisTime = thisTime[idx],
lkpTime = i.lkpTime)
},
by=.EACHI]
## remove NAs
dt_this <- dt_this[complete.cases(dt_this)]
## find all matching 'finalLocation' and 'lookupLocaiton'
setkey(dt, finalLocation)
## inner join (and only return the id columns)
dt_final <- dt[dt_lookup, nomatch=0, allow.cartesian=TRUE][,.(id, lkpId)]
## join dt_this to dt_final (as lkpId must have both 'thisLocation' and 'finalLocation')
setkey(dt_this, id, lkpId)
setkey(dt_final, id, lkpId)
dt_join <- dt_this[dt_final, nomatch=0]
## take the combination with the minimum difference between 'thisTime' and 'lkpTime'
dt_join[,timeDiff := thisTime - lkpTime]
dt_join <- dt_join[ dt_join[order(timeDiff), .I[1], by=id]$V1]
## equivalent dplyr code
# library(dplyr)
# dt_this <- dt_this %>%
# group_by(id) %>%
# arrange(timeDiff) %>%
# slice(1) %>%
# ungroup
这篇关于r - 对数据表的每一行应用函数的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!