寻找两个大数据集之间最近的坐标 [英] Finding closest coordinates between two large data sets

查看:72
本文介绍了寻找两个大数据集之间最近的坐标的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我的目标是根据两个数据集中的坐标确定数据集 2 中与数据集 1 中每个条目最近的条目.数据集 1 包含 180,000 行(仅 1,800 个唯一坐标),数据集 2 包含 4,500 行(完整的 4,500 个唯一坐标).

I am aiming to identify the nearest entry in dataset 2 to each entry in dataset 1 based on the coordinates in both datasets. Dataset 1 contains 180,000 rows (only 1,800 unique coordinates) and dataset 2 contains contains 4,500 rows (full 4,500 unique coordinates).

我试图从类似问题的 stackoverflow 中复制答案.例如:

I have attempted to replicate the answers from similar questions on stackoverflow. for example:

R - 寻找给定半径内的最近邻点和邻点数,坐标 lat-long

计算不同数据框中点之间的距离

然而,这些并没有以我想要的方式解决问题(它们要么加入数据帧,要么检查单个数据帧内的距离).

However these do not solve the problem in the way I want (they either join the data frames or check the distances within a single dataframe).

使用R找到最近的X、Y坐标 和 相关帖子 是我目前找到的最接近的帖子.

The solution in Find the nearest X,Y coordinate using R and related posts are the closest I have found so far.

我对这篇文章的问题是它计算了单个数据帧内坐标之间的距离,我一直无法理解在 RANN::nn2 中更改哪些参数以跨两个数据框.

My issue with the post is that it works out the distance between coordinates within a single dataframe, and I have been unable to understand which parameters to change in RANN::nn2 to do it across two data frames.

建议的代码不起作用:

library(RANN)
dataset1[,4]<- nn2(data=dataset1, query=dataset2, k=2)

注意事项/问题:

1) 应该向查询提供哪个数据集才能找到数据集 2 中与数据集 1 中给定值最接近的值?

1) Which dataset should be provided to the query to find the closest value in dataset 2 to a given value in dataset 1?

2) 有没有办法避免数据集似乎需要相同宽度(列数)的问题?

2) Is there any way to avoid the problem that the datasets seem to need to be the same width (number of columns)?

3) 如何将输出(SRD_IDdistance)添加到数据集 1 的相关条目中?

3) How can the outputs (SRD_ID and distance) be added to the relevant entry in dataset 1?

4)RANN::nn2函数中eps参数的作用是什么?

4) What is the use of eps parameter in the RANN::nn2 function?

目的是使用数据集 2 中最近的站点 ID 以及数据集 1 中的条目与最近的站点之间的距离填充数据集 1 中的 SRC_IDdistance 列数据集 2 中的条目.

The aim is to populate the SRC_ID and distance columns in dataset 1 with the nearest station ID from dataset 2 and the distance between the entry in dataset 1 and the nearest entry in dataset 2.

下表展示了预期的结果.注意:SRC_IDdistance 值是我自己手动添加的示例值,几乎肯定是不正确的,并且可能不会被代码复制.

Below is a table demostrating the expected results. Note: the SRC_ID and distance values are example values I have manually added myself, are almost certainly incorrect and will likely not be replicated by the code.

       id HIGH_PRCN_LAT HIGH_PRCN_LON SRC_ID distance
1 3797987      52.88121     -2.873734     55      350 
2 3798045      53.80945     -2.439163     76     2100

数据:

r 详细信息

platform        x86_64-w64-mingw32
version.string  R version 3.5.3 (2019-03-11)

数据集 1 输入(未缩小到唯一坐标)

data set 1 input (not narrowed down to unique coordinates)

structure(list(id = c(1L, 2L, 4L, 5L, 
6L, 7L, 8L, 9, 10L, 3L), 
    HIGH_PRCN_LAT = c(52.881442267773, 57.8094538200198, 34.0233529, 
    63.8087900198, 53.6888144440184, 63.4462810678651, 21.6075544376207, 
    78.324442654172, 66.85532539759495, 51.623544596), HIGH_PRCN_LON = c(-2.87377812157822, 
    -2.23454414781635, -3.0984448341, -2.439163178635, -7.396111601421454, 
    -5.162345043546359, -8.63311254098095, 3.813289888829932, 
    -3.994325961186105, -8.9065532453272409), SRC_ID = c(NA, NA, 
    NA, NA, NA, NA, NA, NA, NA, NA), distance = c(NA, NA, 
    NA, NA, NA, NA, NA, NA, NA, NA)), row.names = c(NA, 10L), class = "data.frame")

数据集2输入

structure(list(SRC_ID = c(55L, 54L, 23L, 11L, 44L, 21L, 76L, 
5688L, 440L, 61114L), HIGH_PRCN_LAT = c(68.46506, 50.34127, 61.16432, 
42.57807, 52.29879, 68.52132, 87.83912, 55.67825, 29.74444, 34.33228
), HIGH_PRCN_LON = c(-5.0584, -5.95506, -5.75546, -5.47801, -3.42062, 
-6.99441, -2.63457, -2.63057, -7.52216, -1.65532)), row.names = c(NA, 
10L), class = "data.frame")

推荐答案

我写了一个参考这个 线程.该函数被修改为负责报告距离并避免硬编码.请注意,它计算欧几里得距离.

I wrote up an answer referencing this thread. The function is modified to take care of reporting the distance and avoid hard-coding. Please note that it calculates Euclidean distance.

library(data.table)
#Euclidean distance 
mydist <- function(a, b, df1, x, y){

          dt <- data.table(sqrt((df1[[x]]-a)^2 + (df1[[y]]-b)^2))

          return(data.table(Closest.V1  = which.min(dt$V1),
                            Distance    = dt[which.min(dt$V1)]))
           }

setDT(df1)[, j = mydist(HIGH_PRCN_LAT, HIGH_PRCN_LON, setDT(df2), 
                        "HIGH_PRCN_LAT", "HIGH_PRCN_LON"), 
                         by = list(id, HIGH_PRCN_LAT, HIGH_PRCN_LON)]

  #     id HIGH_PRCN_LAT HIGH_PRCN_LON Closest.V1 Distance.V1
  # 1:   1      52.88144     -2.873778          5   0.7990743
  # 2:   2      57.80945     -2.234544          8   2.1676868
  # 3:   4      34.02335     -3.098445         10   1.4758202
  # 4:   5      63.80879     -2.439163          3   4.2415854
  # 5:   6      53.68881     -7.396112          2   3.6445416
  # 6:   7      63.44628     -5.162345          3   2.3577811
  # 7:   8      21.60755     -8.633113          9   8.2123762
  # 8:   9      78.32444      3.813290          7  11.4936496
  # 9:  10      66.85533     -3.994326          1   1.9296370
  # 10:  3      51.62354     -8.906553          2   3.2180026

<小时>

您可以使用 RANN::nn2,但您需要确保使用正确的语法.以下作品!


You can use RANN::nn2, but you need to make sure to use the right syntax. Following works!

as.data.frame(RANN::nn2(df2[,c(2,3)],df1[,c(2,3)],k=1))

#    nn.idx   nn.dists
# 1       5  0.7990743
# 2       8  2.1676868
# 3      10  1.4758202
# 4       3  4.2415854
# 5       2  3.6445416
# 6       3  2.3577811
# 7       9  8.2123762
# 8       7 11.4936496
# 9       1  1.9296370
# 10      2  3.2180026

这篇关于寻找两个大数据集之间最近的坐标的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆