R - 根据第二个数据帧中的最匹配分配列值 [英] R - Assign column value based on closest match in second data frame

查看:196
本文介绍了R - 根据第二个数据帧中的最匹配分配列值的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有两个数据帧,记录器和df(时间是数字):

  logger<  -  data.frame 
time = c(1280248354:1280248413),
temp = runif(60,min = 18,max = 24.5)


df< - data。框架(
obs = c(1:10),
time = runif(10,min = 1280248354,max = 1280248413),
temp = NA

我想搜索logger $ time,以便在df $ time中与每行最匹配,并分配关联的记录器$ temp to df $ temp。到目前为止,我已经成功地使用以下循环:

  for(i in 1:length(df $ time)){ 
nearestto< -which.min(abs((logger $ time) - (df $ time [i])))
df $ temp [i]< -logger $ temp [最近]

但是,我现在有大数据帧(记录器有13,620行,df有266138)处理时间长。我已经看到,循环不是最有效的方法,但我不熟悉替代方案。有更快的方法吗?

解决方案

我将使用 data.table 它使它超级简单,超快速加入。甚至还有一个真正有用的 roll =nearest参数,正是您正在寻找的行为(除了您的示例数据,这不是必需的,因为所有 df 出现在 logger 中的时间。在以下示例中,我将 df $ time 重命名为 df $ time1 ,以清除哪个列属于哪个表。









$ .frames into data.tables with a key column
ldt< - data.table(logger,key =time)
dt< - data.table(df,key =time1)

#根据两个表的key列(time& time1)
#roll =nearest给出所需的行为
#list(obs,time1, temp)给出要从dt返回的列
ldt [dt,list(obs,time1,temp),roll =nearest]
#time obs time1 temp
#1: 1280248361 8 1280248361 18.07644
#2:1280248366 4 1280248366 21.88957
#3:1280248370 3 1280248370 19.09015
#4:1280248376 5 1280248376 22.39770
#5:1280248381 6 1280248381 24.12758
#6:1280248383 10 1280248383 22.70919
#7:1280248385 1 1280248385 18 .78183
#8:1280248389 2 1280248389 18.17874
#9:1280248393 9 1280248393 18.03098
#10:1280248403 7 1280248403 22.74372


I have two data frames, logger and df (times are numeric):

logger <- data.frame(
time = c(1280248354:1280248413),
temp = runif(60,min=18,max=24.5)
)

df <- data.frame(
obs = c(1:10),
time = runif(10,min=1280248354,max=1280248413),
temp = NA
)

I would like to search logger$time for the closest match to each row in df$time, and assign the associated logger$temp to df$temp. So far, I have been successful using the following loop:

for (i in 1:length(df$time)){
closestto<-which.min(abs((logger$time) - (df$time[i])))
df$temp[i]<-logger$temp[closestto]
}

However, I now have large data frames (logger has 13,620 rows and df has 266138) and processing times are long. I've read that loops are not the most efficient way to do things, but I am unfamiliar with alternatives. Is there a faster way to do this?

解决方案

I'd use data.table for this. It makes it super easy and super fast joining on keys. There is even a really helpful roll = "nearest" argument for exactly the behaviour you are looking for (except in your example data it is not necessary because all times from df appear in logger). In the following example I renamed df$time to df$time1 to make it clear which column belongs to which table...

#  Load package
require( data.table )

#  Make data.frames into data.tables with a key column
ldt <- data.table( logger , key = "time" )
dt <- data.table( df , key = "time1" )

#  Join based on the key column of the two tables (time & time1)
#  roll = "nearest" gives the desired behaviour
#  list( obs , time1 , temp ) gives the columns you want to return from dt
ldt[ dt , list( obs , time1 , temp ) , roll = "nearest" ]
#          time obs      time1     temp
# 1: 1280248361   8 1280248361 18.07644
# 2: 1280248366   4 1280248366 21.88957
# 3: 1280248370   3 1280248370 19.09015
# 4: 1280248376   5 1280248376 22.39770
# 5: 1280248381   6 1280248381 24.12758
# 6: 1280248383  10 1280248383 22.70919
# 7: 1280248385   1 1280248385 18.78183
# 8: 1280248389   2 1280248389 18.17874
# 9: 1280248393   9 1280248393 18.03098
#10: 1280248403   7 1280248403 22.74372

这篇关于R - 根据第二个数据帧中的最匹配分配列值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆