找到第二个数据帧中每个元素的两个数据帧之间的最小距离 [英] Find the minimum distance between two data frames, for each element in the second data frame
问题描述
我有一个工作代码来合并两个数据集,计算距离,然后使用dplyr过滤最小距离:
ev1 = data.frame(test_id = c(0,0,0 ,1,1,1),time = c(1,2,3,4,3))
ev2 = data.frame(test_id = c(0,0,0,1,1,1) ),time = c(6,1,8,4,5,11))
data< - merge(ev2,ev1,by = c(test_id),suffixes = c .ev2,.ev1))
data $ distance< - data $ time.ev2 - data $ time.ev1
min_data< - data% >%
group_by(test_id,time.ev2)%>%
过滤器(abs(distance)== min(abs(distance)))
虽然这样工作,合并部分非常慢,感觉效率低下 - 我生成了一个巨大的表,其中包含所有组合的ev2-> ev1相同的test_id,只能将其过滤到一个。在合并期间,似乎应该有一种方法来过滤。在那儿?
更新:使用由akrun概述的data.table方法时,以下两个group by列的情况失败:
ev1 = data.frame(test_id = c(0,0,0,1,1,1),time = c ,3,2,3,4),group_id = c(0,0,0,1,1,1))
ev2 = data.frame(test_id = c(0,0,0,1,1 ,1),time = c(5,6,7,1,2,8),group_id = c(0,0,0,1,1,1))
setkey(setDT(ev1),test_id ,group_id)
DT< - ev1 [ev2,allow.cartesian = TRUE] [,distance:= abs(time-i.time)]
eval(expr,envir,enclosure)中的错误:对象'i.time'未找到
这是我如何使用 data.table
:
require(data.table)
setkey(setDT(ev1),test_id)
ev1 [ev2,。(ev2.time = i.time ,ev1.time = time [which.min(abs(i.time-time))]),by = .EACHI]
#test_id ev2.time ev1.time
#1:0 6 3
#2:0 1 1
#3:0 8 3
#4:1 4 4
#5:1 5 4
#6:1 11 4
在数据格式的
,前缀 x [i]
中加入。表 i。
用于引用 i
中的列,当两者 x
和 i
为特定列共享相同的名称。
请参阅此SO发布,了解如何
这在语法上更直接了解发生了什么,而且内存有效(以牺牲一点速度 1 )为代价完全没有实现整个连接结果。实际上,这完全符合你在发布过程中的说法,即时合并。
- 在速度上,在大多数情况下并不重要。如果
i
中有大量行,那么可能会比j-expression $必须对
中的每一行评估i
中的每一行评估c $ c>。相比之下,@ akrun的回答是笛卡尔加入,然后是一次过滤。所以当记忆力很高时,它不会为<$ c $ c> ij
。但是再一次,除非你使用真正很大的我
这是不常见的,否则这不应该是重要的。
HTH
I have two data frames ev1 and ev2, describing timestamps of two types of events collected over many tests. So, each data frame has columns "test_id", and "timestamp". What I need to find is the minimum distance of ev1 for each ev2, in the same test.
I have a working code that merges the two datasets, calculates the distances, and then uses dplyr to filter for the minimum distance:
ev1 = data.frame(test_id = c(0, 0, 0, 1, 1, 1), time=c(1, 2, 3, 2, 3, 4))
ev2 = data.frame(test_id = c(0, 0, 0, 1, 1, 1), time=c(6, 1, 8, 4, 5, 11))
data <- merge(ev2, ev1, by=c("test_id"), suffixes=c(".ev2", ".ev1"))
data$distance <- data$time.ev2 - data$time.ev1
min_data <- data %>%
group_by(test_id, time.ev2) %>%
filter(abs(distance) == min(abs(distance)))
While this works, the merge part is very slow and feels inefficient -- I'm generating a huge table with all combinations of ev2->ev1 for the same test_id, only to filter it down to one. It seems like there should be a way to "filter on the fly", during the merge. Is there?
Update: The following case with two "group by" columns fails when data.table approach outlined by akrun is used:
ev1 = data.frame(test_id = c(0, 0, 0, 1, 1, 1), time=c(1, 2, 3, 2, 3, 4), group_id=c(0, 0, 0, 1, 1, 1))
ev2 = data.frame(test_id = c(0, 0, 0, 1, 1, 1), time=c(5, 6, 7, 1, 2, 8), group_id=c(0, 0, 0, 1, 1, 1))
setkey(setDT(ev1), test_id, group_id)
DT <- ev1[ev2, allow.cartesian=TRUE][,distance:=abs(time-i.time)]
Error in eval(expr, envir, enclos) : object 'i.time' not found
Here's how I'd do it using data.table
:
require(data.table)
setkey(setDT(ev1), test_id)
ev1[ev2, .(ev2.time=i.time, ev1.time=time[which.min(abs(i.time-time))]), by=.EACHI]
# test_id ev2.time ev1.time
# 1: 0 6 3
# 2: 0 1 1
# 3: 0 8 3
# 4: 1 4 4
# 5: 1 5 4
# 6: 1 11 4
In joins of the form x[i]
in data.table
, the prefix i.
is used to refer the columns in i
, when both x
and i
share the same name for a particular column.
Please see this SO post for an explanation on how this works.
This is syntactically more straightforward to understand what's going on, and is memory efficient (at the expense of little speed1) as it doesn't materialise the entire join result at all. In fact, this does exactly what you say in your post - filter on the fly, while merging.
- On speed, it doesn't matter in most of the cases really. If there are A LOT of rows in
i
, it might be a tad slower as thej-expression
will have to be evaluated for each row ini
. In contrast, @akrun's answer does a cartesian join followed by one filtering. So while it's high on memory, it doesn't evaluatej
for each row ini
. But again, this shouldn't even matter unless you work with really largei
which is not often the case.
HTH
这篇关于找到第二个数据帧中每个元素的两个数据帧之间的最小距离的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!