r-查找两个大型数据集之间的最接近坐标 [英] r - Finding closest coordinates between two large data sets

查看:179
本文介绍了r-查找两个大型数据集之间的最接近坐标的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我旨在基于两个数据集中的坐标来确定数据集中2中与数据集中1中每个条目最近的条目.数据集1包含180,000行(仅1,800个唯一坐标),数据集2包含4,500行(完整的4,500个唯一坐标).

我试图从stackoverflow上复制类似问题的答案.例如:

R-查找给定半径内的最近邻点和邻居数,坐标为lat-long

计算不同数据帧中的点之间的距离

但是这些方法并不能以我想要的方式解决问题(它们要么加入数据框,要么检查单个数据框内的距离).

中的解决方案,使用R 查找最近的X,Y坐标>和相关帖子是我到目前为止发现的最近的帖子.

我在这篇文章中遇到的问题是,它可以计算出单个数据帧中坐标之间的距离,而我无法理解要更改两个数据帧中RANN::nn2中要更改的参数.

建议的无效代码:

library(RANN)
dataset1[,4]<- nn2(data=dataset1, query=dataset2, k=2)

注释/问题:

1)应该向查询提供哪个数据集,以找到数据集2中与数据集1中给定值最接近的值?

2)有什么方法可以避免数据集似乎需要相同的宽度(列数)的问题?

3)如何将输出(SRD_IDdistance)添加到数据集1中的相关条目?

4)RANN::nn2函数中eps参数的用途是什么?

目的是使用数据集中2中最接近的站点ID以及数据集中1中的条目与数据集中2中的最近条目之间的距离填充数据集中1中的SRC_IDdistance列.

下面是一张表格,其中列出了预期的结果. 注意:SRC_IDdistance值是我手动添加的示例值,几乎可以肯定是错误的,并且可能不会被代码复制.

       id HIGH_PRCN_LAT HIGH_PRCN_LON SRC_ID distance
1 3797987      52.88121     -2.873734     55      350 
2 3798045      53.80945     -2.439163     76     2100

数据:

r详细信息

platform        x86_64-w64-mingw32
version.string  R version 3.5.3 (2019-03-11)

数据集1输入(不缩小为唯一坐标)

structure(list(id = c(1L, 2L, 4L, 5L, 
6L, 7L, 8L, 9, 10L, 3L), 
    HIGH_PRCN_LAT = c(52.881442267773, 57.8094538200198, 34.0233529, 
    63.8087900198, 53.6888144440184, 63.4462810678651, 21.6075544376207, 
    78.324442654172, 66.85532539759495, 51.623544596), HIGH_PRCN_LON = c(-2.87377812157822, 
    -2.23454414781635, -3.0984448341, -2.439163178635, -7.396111601421454, 
    -5.162345043546359, -8.63311254098095, 3.813289888829932, 
    -3.994325961186105, -8.9065532453272409), SRC_ID = c(NA, NA, 
    NA, NA, NA, NA, NA, NA, NA, NA), distance = c(NA, NA, 
    NA, NA, NA, NA, NA, NA, NA, NA)), row.names = c(NA, 10L), class = "data.frame")

数据集2输入

structure(list(SRC_ID = c(55L, 54L, 23L, 11L, 44L, 21L, 76L, 
5688L, 440L, 61114L), HIGH_PRCN_LAT = c(68.46506, 50.34127, 61.16432, 
42.57807, 52.29879, 68.52132, 87.83912, 55.67825, 29.74444, 34.33228
), HIGH_PRCN_LON = c(-5.0584, -5.95506, -5.75546, -5.47801, -3.42062, 
-6.99441, -2.63457, -2.63057, -7.52216, -1.65532)), row.names = c(NA, 
10L), class = "data.frame")

解决方案

我写了一个引用此

  #     id HIGH_PRCN_LAT HIGH_PRCN_LON Closest.V1 Distance.V1
  # 1:   1      52.88144     -2.873778          5   0.7990743
  # 2:   2      57.80945     -2.234544          8   2.1676868
  # 3:   4      34.02335     -3.098445         10   1.4758202
  # 4:   5      63.80879     -2.439163          3   4.2415854
  # 5:   6      53.68881     -7.396112          2   3.6445416
  # 6:   7      63.44628     -5.162345          3   2.3577811
  # 7:   8      21.60755     -8.633113          9   8.2123762
  # 8:   9      78.32444      3.813290          7  11.4936496
  # 9:  10      66.85533     -3.994326          1   1.9296370
  # 10:  3      51.62354     -8.906553          2   3.2180026


您可以使用RANN::nn2,但是需要确保使用正确的语法.正在关注作品!

as.data.frame(RANN::nn2(df2[,c(2,3)],df1[,c(2,3)],k=1))

#    nn.idx   nn.dists
# 1       5  0.7990743
# 2       8  2.1676868
# 3      10  1.4758202
# 4       3  4.2415854
# 5       2  3.6445416
# 6       3  2.3577811
# 7       9  8.2123762
# 8       7 11.4936496
# 9       1  1.9296370
# 10      2  3.2180026

I am aiming to identify the nearest entry in dataset 2 to each entry in dataset 1 based on the coordinates in both datasets. Dataset 1 contains 180,000 rows (only 1,800 unique coordinates) and dataset 2 contains contains 4,500 rows (full 4,500 unique coordinates).

I have attempted to replicate the answers from similar questions on stackoverflow. for example:

R - Finding closest neighboring point and number of neighbors within a given radius, coordinates lat-long

Calculating the distance between points in different data frames

However these do not solve the problem in the way I want (they either join the data frames or check the distances within a single dataframe).

The solution in Find the nearest X,Y coordinate using R and related posts are the closest I have found so far.

My issue with the post is that it works out the distance between coordinates within a single dataframe, and I have been unable to understand which parameters to change in RANN::nn2 to do it across two data frames.

Proposed code that doesn't work:

library(RANN)
dataset1[,4]<- nn2(data=dataset1, query=dataset2, k=2)

Notes/Questions:

1) Which dataset should be provided to the query to find the closest value in dataset 2 to a given value in dataset 1?

2) Is there any way to avoid the problem that the datasets seem to need to be the same width (number of columns)?

3) How can the outputs (SRD_ID and distance) be added to the relevant entry in dataset 1?

4) What is the use of eps parameter in the RANN::nn2 function?

The aim is to populate the SRC_ID and distance columns in dataset 1 with the nearest station ID from dataset 2 and the distance between the entry in dataset 1 and the nearest entry in dataset 2.

Below is a table demostrating the expected results. Note: the SRC_ID and distance values are example values I have manually added myself, are almost certainly incorrect and will likely not be replicated by the code.

       id HIGH_PRCN_LAT HIGH_PRCN_LON SRC_ID distance
1 3797987      52.88121     -2.873734     55      350 
2 3798045      53.80945     -2.439163     76     2100

Data:

r details

platform        x86_64-w64-mingw32
version.string  R version 3.5.3 (2019-03-11)

data set 1 input (not narrowed down to unique coordinates)

structure(list(id = c(1L, 2L, 4L, 5L, 
6L, 7L, 8L, 9, 10L, 3L), 
    HIGH_PRCN_LAT = c(52.881442267773, 57.8094538200198, 34.0233529, 
    63.8087900198, 53.6888144440184, 63.4462810678651, 21.6075544376207, 
    78.324442654172, 66.85532539759495, 51.623544596), HIGH_PRCN_LON = c(-2.87377812157822, 
    -2.23454414781635, -3.0984448341, -2.439163178635, -7.396111601421454, 
    -5.162345043546359, -8.63311254098095, 3.813289888829932, 
    -3.994325961186105, -8.9065532453272409), SRC_ID = c(NA, NA, 
    NA, NA, NA, NA, NA, NA, NA, NA), distance = c(NA, NA, 
    NA, NA, NA, NA, NA, NA, NA, NA)), row.names = c(NA, 10L), class = "data.frame")

data set 2 input

structure(list(SRC_ID = c(55L, 54L, 23L, 11L, 44L, 21L, 76L, 
5688L, 440L, 61114L), HIGH_PRCN_LAT = c(68.46506, 50.34127, 61.16432, 
42.57807, 52.29879, 68.52132, 87.83912, 55.67825, 29.74444, 34.33228
), HIGH_PRCN_LON = c(-5.0584, -5.95506, -5.75546, -5.47801, -3.42062, 
-6.99441, -2.63457, -2.63057, -7.52216, -1.65532)), row.names = c(NA, 
10L), class = "data.frame")

解决方案

I wrote up an answer referencing this thread. The function is modified to take care of reporting the distance and avoid hard-coding. Please note that it calculates Euclidean distance.

library(data.table)
#Euclidean distance 
mydist <- function(a, b, df1, x, y){

          dt <- data.table(sqrt((df1[[x]]-a)^2 + (df1[[y]]-b)^2))

          return(data.table(Closest.V1  = which.min(dt$V1),
                            Distance    = dt[which.min(dt$V1)]))
           }

setDT(df1)[, j = mydist(HIGH_PRCN_LAT, HIGH_PRCN_LON, setDT(df2), 
                        "HIGH_PRCN_LAT", "HIGH_PRCN_LON"), 
                         by = list(id, HIGH_PRCN_LAT, HIGH_PRCN_LON)]

  #     id HIGH_PRCN_LAT HIGH_PRCN_LON Closest.V1 Distance.V1
  # 1:   1      52.88144     -2.873778          5   0.7990743
  # 2:   2      57.80945     -2.234544          8   2.1676868
  # 3:   4      34.02335     -3.098445         10   1.4758202
  # 4:   5      63.80879     -2.439163          3   4.2415854
  # 5:   6      53.68881     -7.396112          2   3.6445416
  # 6:   7      63.44628     -5.162345          3   2.3577811
  # 7:   8      21.60755     -8.633113          9   8.2123762
  # 8:   9      78.32444      3.813290          7  11.4936496
  # 9:  10      66.85533     -3.994326          1   1.9296370
  # 10:  3      51.62354     -8.906553          2   3.2180026


You can use RANN::nn2, but you need to make sure to use the right syntax. Following works!

as.data.frame(RANN::nn2(df2[,c(2,3)],df1[,c(2,3)],k=1))

#    nn.idx   nn.dists
# 1       5  0.7990743
# 2       8  2.1676868
# 3      10  1.4758202
# 4       3  4.2415854
# 5       2  3.6445416
# 6       3  2.3577811
# 7       9  8.2123762
# 8       7 11.4936496
# 9       1  1.9296370
# 10      2  3.2180026

这篇关于r-查找两个大型数据集之间的最接近坐标的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆