计算数据表中两行之间的距离 [英] calculating distance between two row in a data.table

查看:98
本文介绍了计算数据表中两行之间的距离的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

问题摘要:我正在使用 data.table 包清理鱼类遥测数据集(即,时间的空间坐标)( Windows 7 PC上的 R (版本)中的1.9.5版)。一些数据点是错误的(例如,遥测设备接收到回声)。我们可以说出这些观点是错误的,因为鱼移动的距离比生物学上可能的距离还要远,并且离群而出。实际的数据集包含来自30条单独鱼类的2,000,000多行数据,因此使用 data.table 包。

Summary of problem: I am cleaning up a fish telemetry dataset (i.e., spatial coordinates through time) using the data.table package (version 1.9.5) in R (version) on a Windows 7 PC. Some of data points are wrong (e.g., the telemetry equipment picked up echos). We can tell these points are wrong because the fish moved a farther distance than is biologically possible and stand out as outliers. The actual dataset contains over 2,000,000 rows of data from 30 individual fish, hence the use of the data.table package.

我要删除相距太远的点(即,行进的距离大于最大距离)。但是,在删除点后,我需要重新计算点之间的行进距离,因为有时会在群集中错误记录2-3个数据点。目前,我有一个 for 循环可以完成工作,但可能远非理想状态,我知道我很可能会错过<$ c中的一些强大工具$ c> data.table 包。

I am removing points that are too far apart (i.e., distance traveled is greater than a maximum distance). However, I need to recalculate distance traveled between points after removing a point because 2-3 data points were sometimes misrecorded in clusters. Currently, I have a for loop that gets the job done, but is likely far from optimal and I know that I am likely missing some of the powerful tools in the data.table package.

作为技术说明,我的空间尺度足够小,以至于欧几里得距离有效,而我的最大距离标准是生物学上合理的。

As technical notes, my spatial scale is small enough that a Euclidean distance works and my maximum distance criteria is biology reasonable.

我寻求帮助的地方:我仔细查看了SO,找到了一些有用的答案,但没有一个完全符合我的问题。具体来说,所有其他答案仅将一列数据与行之间进行比较。

Where I have looked for help: I have looked through SO and found several helpful answers, but none exactly match my problem. Specifically, all of the other answers only compare one column of data to among rows.


  1. answer 使用 data.table 比较两行,但仅查看一个变量。

  1. This answer compares two rows using data.table, but only looks at one variable.

答案看起来很有希望,并使用减少,但我不知道如何在两列中使用减少

This answer looks promising and uses Reduce, but I could not figure out how to use Reduce with two columns.

answer 使用 data.table ,但我不知道如何将其与距离函数一起使用。

This answer uses an indexing feature from data.table, but I could not figure out how to use it with a distance function.

最后,这个答案演示了 roll 函数> data.table 。但是,我也不知道如何在此函数中使用两个变量。

Last, this answer demonstrates the roll function of data.table. However, I could not figure out how to use two variables with this function either.

这是我的MVCE:

library(data.table)
## Create dummy data.table
dt <- data.table(fish = 1,
                 time = 1:6,
                 easting = c(1, 2, 10, 11, 3, 4),
                 northing = c(1, 2, 10, 11, 3, 4))
dt[ , dist := 0]

maxDist = 5

## First pass of calculating distances 
for(index in 2:dim(dt)[1]){
    dt[ index,
       dist := as.numeric(dist(dt[c(index -1, index),
                list(easting, northing)]))]
}

## Loop through and remove points until all of the outliers have been
## removed for the data.table. 
while(all(dt[ , dist < maxDist]) == FALSE){
    dt <- copy(dt[ - dt[ , min(which(dist > maxDist))], ])
    ## Loops through and recalculates distance after removing outlier  
    for(index in 2:dim(dt)[1]){
        dt[ index,
           dist := as.numeric(dist(dt[c(index -1, index),
                    list(easting, northing)]))]
    }
}


推荐答案

我有点困惑,为什么您一直在重新计算距离(并不必要地复制数据)而不是只做一次通行证:

I'm a little confused why you keep recomputing the distance (and needlessly copying data) instead of just doing a single pass:

last = 1
idx = rep(0, nrow(dt))
for (curr in 1:nrow(dt)) {
  if (dist(dt[c(curr, last), .(easting, northing)]) <= maxDist) {
    idx[curr] = curr
    last = curr
  }
}

dt[idx]
#   fish time easting northing
#1:    1    1       1        1
#2:    1    2       2        2
#3:    1    5       3        3
#4:    1    6       4        4

这篇关于计算数据表中两行之间的距离的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆