坐标的模糊匹配 [英] Fuzzy matching of coordinates

查看:175
本文介绍了坐标的模糊匹配的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有两个数据集,其中一个包含人的住址坐标(addresses),另一个包含某些位置的降雨坐标(rain).坐标为标准纬度和经度.我想将这两个集合合并在一起,方法是将每个地址匹配到最近的降雨位置,并使用两个坐标之间的球面距离来确定最近".天真的方法是计算每个地址和每个降雨位置之间的所有成对距离并保持最小值,但是由于我的数据集很大,我想知道是否还有另一种计算有效的方法可以做到这一点.

I have two datasets, one of them containing the coordinates of people's addresses (addresses), and the other one containing the coordinates of rainfall in certain locations (rain). The coordinates are standard lat and lon. I would like to merge these two sets together, by matching each address to the nearest rainfall location, using the spherical distance between two coordinates to determine the "nearest". The naive way is to compute all pairwise distances between each address and each rainfall location and keep the minimum, but since my datasets are quite big, I was wondering if there was another computationally efficient way to do this.

我正在使用geosphere软件包来计算距离.

I'm using the geosphere package to calculate distance.

这是数据的子集.

rain <- structure(list(lat = c(-179.75, -179.75, -179.75, -179.75, -179.75, 
-179.75, -179.75, -179.75, -179.75, -179.75), lon = c(71.25, 
68.75, 68.25, 67.75, 67.25, 66.75, 66.25, 65.75, 65.25, -16.75
), rainfall = c(0, 4.9, 4.6, 4.9, 8.9, 15.2, 24.2, 16.3, 12.2, 
365.4)), .Names = c("lat", "lon", "rainfall"), class = "data.frame", row.names = c(NA, 
-10L))


addresses <- structure(list(address_lat = c(-175.33, -175.20, -177.65, -174.10, -175.80, 
-179.50, -179.23, -179.12, -178.75, -174.77), address_lon = c(70.25, 
69.75, 62.23, 60.50, 66.25, 61.75, 62.54, 63.70, 61.45, -15.80),
person_id = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)), .Names = c("address_lat", "address_lon",     
"person_id"), class = "data.frame", row.names = c(NA, -10L))

我在一组中有300,000个唯一坐标对,在另一组中有80,000个以上.我唯一的想法是使用两个for循环,一个循环遍历地址坐标对列表,然后另一个嵌套以计算从每个地址到所有降雨位置的距离,然后保持最小.

I've got 300,000 unique coordinate pairs in one set, and over 80,000 in the other. The only idea I have is to use two for loops, one to run over a list of address coordinate pairs, then another nested one to calculate the distance from each address to all rainfall locations, then keeping the smallest.

推荐答案

首先我应该提到,我认为纬度和经度的列标签应该颠倒...否则您的纬度将小于-90 . :-)我已经在下面的解决方案中做到了这一点.

First I should mention that I think that the column labels for latitude and longitude should be reversed... otherwise you end up with latitudes that are less than -90. :-) I have done this for my solution below.

library(geosphere)

D = distm(addresses[, 1:2], rain[, 1:2])
#
cbind(addresses, rain[apply(D, 1, which.min),])

首先,您形成距离矩阵.该矩阵中的每一行都给出了从地址之一到每个降雨观测值的距离.我们使用which.min来挑选每一行中最小的条目,然后使用它来索引降雨数据.

First you form the distance matrix. Each row in this matrix gives the distances from one of the addresses to each of the rainfall observations. We use which.min to pick out the smallest entry in each row and then use this to index into the rainfall data.

这篇关于坐标的模糊匹配的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆