如何计算大地理距离矩阵 [英] How to compute big geographic distance matrix

查看:445
本文介绍了如何计算大地理距离矩阵的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个ID和坐标数据框.我需要计算所有ID之间的地理距离,将彼此相距太远的ID删除,然后继续进行分析.

I have a dataframe of ids and coordinates. I need to calculate the geographic distance between all my ids, drop the ones that are too far from each other, and then go on with my analysis.

我有30k id,它将生成30k x 30k矩阵.这是一个示例:

I have 30k ids, which would generate a 30k x 30k matrix. Here is a sample:

 latitude longitude        id
-23.52472 -46.47785 917_62346
-23.62010 -46.69345 244_42975
-23.61636 -46.48148 302_75289
-23.53826 -46.46756 917_96304
-23.58266 -46.54495 302_84126
-23.47005 -46.70921 233_97098
-23.49235 -46.49342 917_62953
-23.52226 -46.72710 244_42245
-23.64853 -46.72237 635_90928
-23.49640 -46.61215  244_2662

x2 = structure(list(latitude = c(-23.5247247, -23.6200954, -23.6163624, 
-23.5382557, -23.5826609, -23.4700519, -23.4923465, -23.5222581, 
-23.6485288, -23.4964047), longitude = c(-46.4778499, -46.6934512, 
-46.4814794, -46.4675563, -46.5449536, -46.7092093, -46.4934192, 
-46.7270957, -46.7223717, -46.6121477), id = c("917_62346", "244_42975", 
"302_75289", "917_96304", "302_84126", "233_97098", "917_62953", 
"244_42245", "635_90928", "244_2662")), .Names = c("latitude", 
"longitude", "id"), row.names = c(12041L, 18549L, 13641L, 28386L, 
9380L, 6064L, 12724L, 21671L, 18939L, 3396L), class = "data.frame")

首先,我尝试使用geosphere软件包直接进行搜索:

First I tried to go straight for it, using the geosphere package:

library(geosphere)
library(data.table)
d.matrix <- distm(cbind(x2$longitude, x2$latitude))

由于内存问题Error: cannot allocate vector of size 15.4 Gb,这不起作用.我的第二个尝试是首先预先生成所有成对组合,然后与原始数据集合并以获得经纬度,然后计算距离,例如

This does not work, because of memory issues, Error: cannot allocate vector of size 15.4 Gb. My second attempt was to first generate all the pairwise combinations beforehand, than merge with the original data set to get the lats and lons, and then calculate the distances, such as

dis.long <- expand.grid(x2$id, x2$id)
dis.long <- merge(dis.long, x2, by.x = "Var1", by.y = "id")
dis.long <- merge(dis.long, x2, by.x = "Var2", by.y = "id")
dis.long <- dis.long[ , dist_km2 := distGeo(matrix(c(longitude.x, latitude.x), ncol = 2), 
                                        matrix(c(longitude.y, latitude.y), ncol = 2))/1000]

但是,expand_grid的内存不足.这对我来说很奇怪,因为结果矩阵将是900mi行乘2列,并且我已经处理了更大的数据集(例如200 mi x 50矩阵).

However, expand_grid runs out of memory. This is strange to me, since the resulting matrix would be 900mi rows by 2 cols, and I already deal with data sets way larger (like 200 mi x 50 matrices).

另一个观察结果,我已经尝试使用像new_id = seq(1L,30000L,1L)这样的id来查看整数是否可以解决它,但是当我尝试扩展时遇到相同的内存问题.

Another observation, I already tried using ids such as new_id = seq(1L,30000L,1L), to see whether integers would solve it, but I get the same memory issue when I try to expand.

除了16gb的Ram台式机外,我目前处于这些配置下

I am currently under these configurations, besides a 16gb Ram desktop

> sessionInfo()
R version 3.3.3 (2017-03-06)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows >= 8 x64 (build 9200)

locale:
[1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United States.1252   
[3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C                          
[5] LC_TIME=English_United States.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] xlsx_0.5.7        xlsxjars_0.6.1    rJava_0.9-8       geosphere_1.5-5   sp_1.2-5          haven_1.0.0      
[7] stringr_1.2.0     data.table_1.10.4

有人可以给我一个如何计算这些距离的想法吗?为什么我不能生成此特定的expand.grid却能够构造更大的对象?

Can anybody give me an idea of how to calculate these distances? And why I cant generate this specific expand.grid while being able to construct bigger objects?

推荐答案

您不需要比较all-vs-all,它包括自我比较和方向比较(A-B!= B-A);因此,您应该尝试使用combn而不是expand.grid

You do not need to compare all-vs-all, which includes self-comparison and directional comparison (A-B != B-A); therefore you should try combn instead of expand.grid

x2 = structure(list(latitude = c(-23.5247247, -23.6200954, -23.6163624, 
-23.5382557, -23.5826609, -23.4700519, -23.4923465, -23.5222581, 
-23.6485288, -23.4964047), longitude = c(-46.4778499, -46.6934512, 
-46.4814794, -46.4675563, -46.5449536, -46.7092093, -46.4934192, 
-46.7270957, -46.7223717, -46.6121477), id = c("917_62346", "244_42975", 
"302_75289", "917_96304", "302_84126", "233_97098", "917_62953", 
"244_42245", "635_90928", "244_2662")), .Names = c("latitude", 
"longitude", "id"), row.names = c(12041L, 18549L, 13641L, 28386L, 
9380L, 6064L, 12724L, 21671L, 18939L, 3396L), class = "data.frame")

expand.grid

OP <- function(df) {
            x3 = expand.grid(df$id, df$id)
            Y <- merge(x3, df, by.x = "Var1", by.y = "id")
            Y <- merge(Y, df, by.x = "Var2", by.y = "id")
            return(Y)
      }

vs combn

CP <- function(df) {
            Did = as.data.frame(t(combn(df$id, 2)))
            Z <- merge(Did, df, by.x = "V1", by.y = "id")
            Z <- merge(Z, df, by.x = "V2", by.y = "id")
            return(Z)
      }

比较

dis.long <- OP(x2)
object.size(dis.long)
# 7320 bytes

new <- CP(x2)
object.size(new)
# 5016 bytes

更大的例子

num <- 5e2
bigx <- data.frame(latitude=rnorm(num)*-23, longitude=rnorm(num)*-46, id=1:num)

bigdl <- OP(bigx)
object.size(bigdl)
# 10001224 bytes

bignew <- CP(bigx)
object.size(bignew)
# 4991224 bytes

大约一半大小

这篇关于如何计算大地理距离矩阵的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆