找到第二个数据帧中每个元素的两个数据帧之间的最小距离 [英] Find the minimum distance between two data frames, for each element in the second data frame

查看:93
本文介绍了找到第二个数据帧中每个元素的两个数据帧之间的最小距离的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有两个数据帧ev1和ev2,描述了在许多测试中收集的两种类型的事件的时间戳。因此,每个数据框都有列test_id和timestamp。我需要找到的是每个ev2的ev1的最小距离,在同一个测试中。



我有一个工作代码来合并两个数据集,计算距离,然后使用dplyr过滤最小距离:

  ev1 = data.frame(test_id = c(0,0,0 ,1,1,1),time = c(1,2,3,4,3))
ev2 = data.frame(test_id = c(0,0,0,1,1,1) ),time = c(6,1,8,4,5,11))

data< - merge(ev2,ev1,by = c(test_id),suffixes = c .ev2,.ev1))

data $ distance< - data $ time.ev2 - data $ time.ev1

min_data< - data% >%
group_by(test_id,time.ev2)%>%
过滤器(abs(distance)== min(abs(distance)))

虽然这样工作,合并部分非常慢,感觉效率低下 - 我生成了一个巨大的表,其中包含所有组合的ev2-> ev1相同的test_id,只能将其过滤到一个。在合并期间,似乎应该有一种方法来过滤。在那儿?



更新:使用由akrun概述的data.table方法时,以下两个group by列的情况失败:

  ev1 = data.frame(test_id = c(0,0,0,1,1,1),time = c ,3,2,3,4),group_id = c(0,0,0,1,1,1))
ev2 = data.frame(test_id = c(0,0,0,1,1 ,1),time = c(5,6,7,1,2,8),group_id = c(0,0,0,1,1,1))
setkey(setDT(ev1),test_id ,group_id)
DT< - ev1 [ev2,allow.cartesian = TRUE] [,distance:= abs(time-i.time)]

eval(expr,envir,enclosure)中的错误:对象'i.time'未找到

解决方案

这是我如何使用 data.table

  require(data.table)
setkey(setDT(ev1),test_id)
ev1 [ev2,。(ev2.time = i.time ,ev1.time = time [which.min(abs(i.time-time))]),by = .EACHI]
#test_id ev2.time ev1.time
#1:0 6 3
#2:0 1 1
#3:0 8 3
#4:1 4 4
#5:1 5 4
#6:1 11 4

数据格式的 x [i] 中加入。表,前缀 i。用于引用 i 中的列,当两者 x i 为特定列共享相同的名称。



请参阅此SO发布,了解如何



这在语法上更直接了解发生了什么,而且内存有效(以牺牲一点速度 1 )为代价完全没有实现整个连接结果。实际上,这完全符合你在发布过程中的说法,即时合并。


  1. 在速度上,在大多数情况下并不重要。如果 i 中有大量行,那么可能会比 j-expression i 中的每一行评估c $ c>。相比之下,@ akrun的回答是笛卡尔加入,然后是一次过滤。所以当记忆力很高时,它不会为<$​​ c $ c> i 中的每一行评估 j 。但是再一次,除非你使用真正很大的 这是不常见的,否则这不应该是重要的。

HTH


I have two data frames ev1 and ev2, describing timestamps of two types of events collected over many tests. So, each data frame has columns "test_id", and "timestamp". What I need to find is the minimum distance of ev1 for each ev2, in the same test.

I have a working code that merges the two datasets, calculates the distances, and then uses dplyr to filter for the minimum distance:

ev1 = data.frame(test_id = c(0, 0, 0, 1, 1, 1), time=c(1, 2, 3, 2, 3, 4))
ev2 = data.frame(test_id = c(0, 0, 0, 1, 1, 1), time=c(6, 1, 8, 4, 5, 11))

data <- merge(ev2, ev1, by=c("test_id"), suffixes=c(".ev2", ".ev1"))

data$distance <- data$time.ev2 - data$time.ev1

min_data <- data %>%
  group_by(test_id, time.ev2) %>%
  filter(abs(distance) == min(abs(distance)))

While this works, the merge part is very slow and feels inefficient -- I'm generating a huge table with all combinations of ev2->ev1 for the same test_id, only to filter it down to one. It seems like there should be a way to "filter on the fly", during the merge. Is there?

Update: The following case with two "group by" columns fails when data.table approach outlined by akrun is used:

ev1 = data.frame(test_id = c(0, 0, 0, 1, 1, 1), time=c(1, 2, 3, 2, 3, 4), group_id=c(0, 0, 0, 1, 1, 1))
ev2 = data.frame(test_id = c(0, 0, 0, 1, 1, 1), time=c(5, 6, 7, 1, 2, 8), group_id=c(0, 0, 0, 1, 1, 1))
setkey(setDT(ev1), test_id, group_id)
DT <- ev1[ev2, allow.cartesian=TRUE][,distance:=abs(time-i.time)]

Error in eval(expr, envir, enclos) : object 'i.time' not found

解决方案

Here's how I'd do it using data.table:

require(data.table)
setkey(setDT(ev1), test_id)
ev1[ev2, .(ev2.time=i.time, ev1.time=time[which.min(abs(i.time-time))]), by=.EACHI]
#    test_id ev2.time ev1.time
# 1:       0        6        3
# 2:       0        1        1
# 3:       0        8        3
# 4:       1        4        4
# 5:       1        5        4
# 6:       1       11        4

In joins of the form x[i] in data.table, the prefix i. is used to refer the columns in i, when both x and i share the same name for a particular column.

Please see this SO post for an explanation on how this works.

This is syntactically more straightforward to understand what's going on, and is memory efficient (at the expense of little speed1) as it doesn't materialise the entire join result at all. In fact, this does exactly what you say in your post - filter on the fly, while merging.

  1. On speed, it doesn't matter in most of the cases really. If there are A LOT of rows in i, it might be a tad slower as the j-expression will have to be evaluated for each row in i. In contrast, @akrun's answer does a cartesian join followed by one filtering. So while it's high on memory, it doesn't evaluate j for each row in i. But again, this shouldn't even matter unless you work with really large i which is not often the case.

HTH

这篇关于找到第二个数据帧中每个元素的两个数据帧之间的最小距离的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆