如何计算R中低于某个阈值的2个坐标之间的距离? [英] How to calculate distance between 2 coordinates below a certain threshold in R?

查看:141
本文介绍了如何计算R中低于某个阈值的2个坐标之间的距离?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有44,000个美国邮政编码,它在R中是对应的质心lat / long。这是从R中的包裹'zipcode'
我需要计算每个邮政编码之间的距离并保持那些距离较小超过5英里。问题是要计算邮政编码之间的所有距离,我必须创建一个大小为44,000x44,0000的矢量,这不能由于空间问题。



我检查了最接近我要求的R中的帖子是吐出2个数据集与经纬度之间的最小距离的帖子

  DB1 <  -  data.frame(location_id = 1:7000,LATITUDE = runif(7000,min = -90,max = 90),LONGITUDE = runif(7000,min = -180,max = 180))
DB2 < - data.frame(LOCATION_ID = 7001:12000,纬度= runif(5000,最小值= -90,最大值= 90),经度= runif(5000,最小值= -180,最大= 180))

DistFun< - 功能(ID){
TMP< - DB1 [DB1 $ LOCATION_ID == ID,]
TMP1< - distGeo(TMP [,3:2],DB2 [ ,3:2])
TMP2 < - data.frame(DB1ID = ID,DB2ID = DB2 [which.min(TMP1),1],DistanceBetween = min(TMP1))
print )
返回(TMP2)
}

DistanceMatrix< - rbind_all(lapply(DB1 $ LOCATION_ID,DistFun))

即使我们可以修改以上代码合并所有距离< = 5英里(例如),执行速度非常缓慢。

是否有一种有效的方式来达到所有的邮政编码组合距离每个其他质心都不到5英里?

解决方案

一次生成整个距离矩阵将非常耗时,循环每个独特的邮编组合 - 非常耗时。让我们找到一些妥协。



我建议分割 zipcode data.frame 分成几行(例如)100行(借助于 chunk 函数从程序包的帮助) ,然后计算44336到100点之间的距离,根据目标距离阈值进行过滤,然后移动到下一个数据块。在我的示例中,我将 zipcode 数据转换为 data.table 以获得一些速度并节省RAM。

  library(zipcode)
library(data.table)
library(magrittr)
library(geosphere)

数据(zipcode)

setDT(zipcode)
zipcode [,dum:= NA]#我们需要它来完全外连接

仅供参考 - 这是RAM中每条数据的大小。

  merge(zipcode,zipcode [1:100],by =dum,allow.cartesian = T)%>%
object.size() %>%print(unit =Mb)
#358.2 Mb

本身。

  lapply(bit :: chunk(1,nrow(zipcode),1e2),function(ridx){
merge(zipcode,zipcode [ridx [1]:ridx [2]],by =dum,allow.cartesian = T)[
,dist:= distGeo(matrix(c(longitude.x,纬度x),ncol = 2),
矩阵(c(longitude.y,latitude.y),ncol = 2))/ 1609.34#me [dist = 5#必要距离阈值
] [,dum:= NULL]
})%>%rbindlist - > zip_nearby_dt

zip_nearby_dt#不是全部!前10个块只有

zip.x city.x state.x latitude.x longitude.x zip.y city.y state.y latitude.y longitude.y dist
1: 00210朴茨茅斯NH 43.00590 -71.01320 00210朴茨茅斯NH 43.00590 -71.01320 0.000000
2 00210朴茨茅斯NH 43.00590 -71.01320 00211朴茨茅斯NH 43.00590 -71.01320 0.000000
3:00210朴茨茅斯NH 43.00590 -71.01320 00212朴茨茅斯NH 43.00590 -71.01320 0.000000
4分配:00210朴次茅斯NH 43.00590 -71.01320 00213朴次茅斯NH 43.00590 -71.01320 0.000000
5分配:00210朴次茅斯NH 43.00590 -71.01320 00214朴次茅斯NH 43.00590 -71.01320 0.000000
---
15252:02906 Providence RI 41.83635 -71.39427 02771 Seekonk MA 41.84345 -71.32343 3.688 747
15253:02912 Providence RI 41.82674 -71.39770 02771 Seekonk MA 41.84345 -71.32343 4.003095
15254:02914 East Providence RI 41.81240 -71.36834 02771 Seekonk MA 41.84345 -71.32343 3.156966
15255:02916 Rumford RI 41.84325 -71.35391 02769雷霍博特MA 41.83507 -71.26115 4.820599
15256:02916拉姆福德RI 41.84325 -71.35391 02771塞科克MA 41.84345 -71.32343 1.573050

在我的机器上,处理10个块需要1.7分钟,因此整个处理过程可能需要70-80分钟,但速度并不快,但可能令人满意。我们可以将块大小增加到200或300行,这取决于可用的RAM容量,这将分别缩短处理时间2到3次。

此解决方案的缺点是得到的 data.table 包含重复行 - 我的意思是从A点到B点以及从B到A有两个距离。这可能需要一些额外的过滤。


I have 44,000 US Zip codes and it's corresponding centroid lat/long in R. This is from the package 'zipcode' in R. I need to calculate the distance between each zipcode and keep those distances that are less than 5 miles. The problem is to calculate all distances between the zipcodes I have to create a vector of size 44,000x44,0000 which I can't due to space issues.

I checked through the posts in R, the closest to my requirement is one that spits out the minimum distance between 2 datasets with lat/long

DB1 <- data.frame(location_id=1:7000,LATITUDE=runif(7000,min = -90,max = 90),LONGITUDE=runif(7000,min = -180,max = 180))
DB2 <- data.frame(location_id=7001:12000,LATITUDE=runif(5000,min = -90,max = 90),LONGITUDE=runif(5000,min = -180,max = 180))

DistFun <- function(ID){
  TMP <- DB1[DB1$location_id==ID,]
  TMP1 <- distGeo(TMP[,3:2],DB2[,3:2])
  TMP2 <- data.frame(DB1ID=ID,DB2ID=DB2[which.min(TMP1),1],DistanceBetween=min(TMP1)      ) 
  print(ID)
  return(TMP2)
}

DistanceMatrix <- rbind_all(lapply(DB1$location_id, DistFun))

Even if we can modify the above code to incorporate all distances <= 5 miles (for eg), it is extremely slow in execution.

Is there an efficient way to arrive at all zip code combinations that are <=5 miles from each others centroids?

解决方案

Generating the whole distance matrix at a time will be very RAM consuming, looping over each combination of unique zipcodes - very time consuming. Lets find some compromise.

I suggest chunking the zipcode data.frame into pieces of (for example) 100 rows (with the help of chunk function from package bit), then calculating distances between 44336 and 100 points, filtering according to the target distance treshold and then moving on to the next data chunk. In my example I convert zipcode data into data.table to gain some speed and save RAM.

library(zipcode)
library(data.table)
library(magrittr)
library(geosphere)

data(zipcode)

setDT(zipcode)
zipcode[, dum := NA] # we'll need it for full outer join

Just for information - that's the approximate size of each piece of data in RAM.

merge(zipcode, zipcode[1:100], by = "dum", allow.cartesian = T) %>% 
  object.size() %>% print(unit = "Mb")
# 358.2 Mb

The code itself.

lapply(bit::chunk(1, nrow(zipcode), 1e2), function(ridx) {
  merge(zipcode, zipcode[ridx[1]:ridx[2]], by = "dum", allow.cartesian = T)[
    , dist := distGeo(matrix(c(longitude.x, latitude.x), ncol = 2), 
                      matrix(c(longitude.y, latitude.y), ncol = 2))/1609.34 # meters to miles
    ][dist <= 5 # necessary distance treshold
      ][, dum := NULL]
  }) %>% rbindlist -> zip_nearby_dt

zip_nearby_dt # not the whole! for first 10 chunks only

       zip.x          city.x state.x latitude.x longitude.x zip.y     city.y state.y latitude.y longitude.y     dist
    1: 00210      Portsmouth      NH   43.00590   -71.01320 00210 Portsmouth      NH   43.00590   -71.01320 0.000000
    2: 00210      Portsmouth      NH   43.00590   -71.01320 00211 Portsmouth      NH   43.00590   -71.01320 0.000000
    3: 00210      Portsmouth      NH   43.00590   -71.01320 00212 Portsmouth      NH   43.00590   -71.01320 0.000000
    4: 00210      Portsmouth      NH   43.00590   -71.01320 00213 Portsmouth      NH   43.00590   -71.01320 0.000000
    5: 00210      Portsmouth      NH   43.00590   -71.01320 00214 Portsmouth      NH   43.00590   -71.01320 0.000000
---                                                                                                              
15252: 02906      Providence      RI   41.83635   -71.39427 02771    Seekonk      MA   41.84345   -71.32343 3.688747
15253: 02912      Providence      RI   41.82674   -71.39770 02771    Seekonk      MA   41.84345   -71.32343 4.003095
15254: 02914 East Providence      RI   41.81240   -71.36834 02771    Seekonk      MA   41.84345   -71.32343 3.156966
15255: 02916         Rumford      RI   41.84325   -71.35391 02769   Rehoboth      MA   41.83507   -71.26115 4.820599
15256: 02916         Rumford      RI   41.84325   -71.35391 02771    Seekonk      MA   41.84345   -71.32343 1.573050

On my machine it took 1.7 minutes to process 10 chunks, so the whole processing may take 70-80 minutes, not fast, but may be satisfying. We can increase the chunk size to 200 or 300 rows depending on available RAM volume, this will shorten the processing time 2 or 3 times respectively.

The drawback of this solution is that the resulting data.table contains "duplicated" rows - I mean there are both distances from point A to point B, and from B to A. This may need some additional filtering.

这篇关于如何计算R中低于某个阈值的2个坐标之间的距离?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆