r-查找两个大型数据集之间的最接近坐标 [英] r - Finding closest coordinates between two large data sets
问题描述
我试图从stackoverflow上复制类似问题的答案.例如:
R-查找给定半径内的最近邻点和邻居数,坐标为lat-long
但是这些方法并不能以我想要的方式解决问题(它们要么加入数据框,要么检查单个数据框内的距离).
中的解决方案,使用R 查找最近的X,Y坐标>和相关帖子是我到目前为止发现的最近的帖子.
我在这篇文章中遇到的问题是,它可以计算出单个数据帧中坐标之间的距离,而我无法理解要更改两个数据帧中RANN::nn2
中要更改的参数.
建议的无效代码:
library(RANN)
dataset1[,4]<- nn2(data=dataset1, query=dataset2, k=2)
注释/问题:
1)应该向查询提供哪个数据集,以找到数据集2中与数据集1中给定值最接近的值?
2)有什么方法可以避免数据集似乎需要相同的宽度(列数)的问题?
3)如何将输出(SRD_ID
和distance
)添加到数据集1中的相关条目?
4)RANN::nn2
函数中eps
参数的用途是什么?
目的是使用数据集中2中最接近的站点ID以及数据集中1中的条目与数据集中2中的最近条目之间的距离填充数据集中1中的SRC_ID
和distance
列.
下面是一张表格,其中列出了预期的结果. 注意:SRC_ID
和distance
值是我手动添加的示例值,几乎可以肯定是错误的,并且可能不会被代码复制.
id HIGH_PRCN_LAT HIGH_PRCN_LON SRC_ID distance
1 3797987 52.88121 -2.873734 55 350
2 3798045 53.80945 -2.439163 76 2100
数据:
r详细信息
platform x86_64-w64-mingw32
version.string R version 3.5.3 (2019-03-11)
数据集1输入(不缩小为唯一坐标)
structure(list(id = c(1L, 2L, 4L, 5L,
6L, 7L, 8L, 9, 10L, 3L),
HIGH_PRCN_LAT = c(52.881442267773, 57.8094538200198, 34.0233529,
63.8087900198, 53.6888144440184, 63.4462810678651, 21.6075544376207,
78.324442654172, 66.85532539759495, 51.623544596), HIGH_PRCN_LON = c(-2.87377812157822,
-2.23454414781635, -3.0984448341, -2.439163178635, -7.396111601421454,
-5.162345043546359, -8.63311254098095, 3.813289888829932,
-3.994325961186105, -8.9065532453272409), SRC_ID = c(NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA), distance = c(NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA)), row.names = c(NA, 10L), class = "data.frame")
数据集2输入
structure(list(SRC_ID = c(55L, 54L, 23L, 11L, 44L, 21L, 76L,
5688L, 440L, 61114L), HIGH_PRCN_LAT = c(68.46506, 50.34127, 61.16432,
42.57807, 52.29879, 68.52132, 87.83912, 55.67825, 29.74444, 34.33228
), HIGH_PRCN_LON = c(-5.0584, -5.95506, -5.75546, -5.47801, -3.42062,
-6.99441, -2.63457, -2.63057, -7.52216, -1.65532)), row.names = c(NA,
10L), class = "data.frame")
我写了一个引用此
您可以使用
I am aiming to identify the nearest entry in dataset 2 to each entry in dataset 1 based on the coordinates in both datasets. Dataset 1 contains 180,000 rows (only 1,800 unique coordinates) and dataset 2 contains contains 4,500 rows (full 4,500 unique coordinates). I have attempted to replicate the answers from similar questions on stackoverflow. for example: Calculating the distance between points in different data frames However these do not solve the problem in the way I want (they either join the data frames or check the distances within a single dataframe). The solution in Find the nearest X,Y coordinate using R and related posts are the closest I have found so far. My issue with the post is that it works out the distance between coordinates within a single dataframe, and I have been unable to understand which parameters to change in Proposed code that doesn't work: Notes/Questions: 1) Which dataset should be provided to the query to find the closest value in dataset 2 to a given value in dataset 1? 2) Is there any way to avoid the problem that the datasets seem to need to be the same width (number of columns)? 3) How can the outputs ( 4) What is the use of The aim is to populate the Below is a table demostrating the expected results. Note: the
r details data set 1 input (not narrowed down to unique coordinates) data set 2 input
I wrote up an answer referencing this thread. The function is modified to take care of reporting the distance and avoid hard-coding. Please note that it calculates Euclidean distance.
You can use
这篇关于r-查找两个大型数据集之间的最接近坐标的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋! # id HIGH_PRCN_LAT HIGH_PRCN_LON Closest.V1 Distance.V1
# 1: 1 52.88144 -2.873778 5 0.7990743
# 2: 2 57.80945 -2.234544 8 2.1676868
# 3: 4 34.02335 -3.098445 10 1.4758202
# 4: 5 63.80879 -2.439163 3 4.2415854
# 5: 6 53.68881 -7.396112 2 3.6445416
# 6: 7 63.44628 -5.162345 3 2.3577811
# 7: 8 21.60755 -8.633113 9 8.2123762
# 8: 9 78.32444 3.813290 7 11.4936496
# 9: 10 66.85533 -3.994326 1 1.9296370
# 10: 3 51.62354 -8.906553 2 3.2180026
RANN::nn2
,但是需要确保使用正确的语法.正在关注作品!as.data.frame(RANN::nn2(df2[,c(2,3)],df1[,c(2,3)],k=1))
# nn.idx nn.dists
# 1 5 0.7990743
# 2 8 2.1676868
# 3 10 1.4758202
# 4 3 4.2415854
# 5 2 3.6445416
# 6 3 2.3577811
# 7 9 8.2123762
# 8 7 11.4936496
# 9 1 1.9296370
# 10 2 3.2180026
RANN::nn2
to do it across two data frames.library(RANN)
dataset1[,4]<- nn2(data=dataset1, query=dataset2, k=2)
SRD_ID
and distance
) be added to the relevant entry in dataset 1?eps
parameter in the RANN::nn2
function?SRC_ID
and distance
columns in dataset 1 with the nearest station ID from dataset 2 and the distance between the entry in dataset 1 and the nearest entry in dataset 2.SRC_ID
and distance
values are example values I have manually added myself, are almost certainly incorrect and will likely not be replicated by the code. id HIGH_PRCN_LAT HIGH_PRCN_LON SRC_ID distance
1 3797987 52.88121 -2.873734 55 350
2 3798045 53.80945 -2.439163 76 2100
Data:
platform x86_64-w64-mingw32
version.string R version 3.5.3 (2019-03-11)
structure(list(id = c(1L, 2L, 4L, 5L,
6L, 7L, 8L, 9, 10L, 3L),
HIGH_PRCN_LAT = c(52.881442267773, 57.8094538200198, 34.0233529,
63.8087900198, 53.6888144440184, 63.4462810678651, 21.6075544376207,
78.324442654172, 66.85532539759495, 51.623544596), HIGH_PRCN_LON = c(-2.87377812157822,
-2.23454414781635, -3.0984448341, -2.439163178635, -7.396111601421454,
-5.162345043546359, -8.63311254098095, 3.813289888829932,
-3.994325961186105, -8.9065532453272409), SRC_ID = c(NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA), distance = c(NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA)), row.names = c(NA, 10L), class = "data.frame")
structure(list(SRC_ID = c(55L, 54L, 23L, 11L, 44L, 21L, 76L,
5688L, 440L, 61114L), HIGH_PRCN_LAT = c(68.46506, 50.34127, 61.16432,
42.57807, 52.29879, 68.52132, 87.83912, 55.67825, 29.74444, 34.33228
), HIGH_PRCN_LON = c(-5.0584, -5.95506, -5.75546, -5.47801, -3.42062,
-6.99441, -2.63457, -2.63057, -7.52216, -1.65532)), row.names = c(NA,
10L), class = "data.frame")
library(data.table)
#Euclidean distance
mydist <- function(a, b, df1, x, y){
dt <- data.table(sqrt((df1[[x]]-a)^2 + (df1[[y]]-b)^2))
return(data.table(Closest.V1 = which.min(dt$V1),
Distance = dt[which.min(dt$V1)]))
}
setDT(df1)[, j = mydist(HIGH_PRCN_LAT, HIGH_PRCN_LON, setDT(df2),
"HIGH_PRCN_LAT", "HIGH_PRCN_LON"),
by = list(id, HIGH_PRCN_LAT, HIGH_PRCN_LON)]
# id HIGH_PRCN_LAT HIGH_PRCN_LON Closest.V1 Distance.V1
# 1: 1 52.88144 -2.873778 5 0.7990743
# 2: 2 57.80945 -2.234544 8 2.1676868
# 3: 4 34.02335 -3.098445 10 1.4758202
# 4: 5 63.80879 -2.439163 3 4.2415854
# 5: 6 53.68881 -7.396112 2 3.6445416
# 6: 7 63.44628 -5.162345 3 2.3577811
# 7: 8 21.60755 -8.633113 9 8.2123762
# 8: 9 78.32444 3.813290 7 11.4936496
# 9: 10 66.85533 -3.994326 1 1.9296370
# 10: 3 51.62354 -8.906553 2 3.2180026
RANN::nn2
, but you need to make sure to use the right syntax. Following works!as.data.frame(RANN::nn2(df2[,c(2,3)],df1[,c(2,3)],k=1))
# nn.idx nn.dists
# 1 5 0.7990743
# 2 8 2.1676868
# 3 10 1.4758202
# 4 3 4.2415854
# 5 2 3.6445416
# 6 3 2.3577811
# 7 9 8.2123762
# 8 7 11.4936496
# 9 1 1.9296370
# 10 2 3.2180026