如何使用doParallel计算R中邮政编码之间的距离? [英] How to use doParallel for calculating distance between zipcodes in R?

查看:28
本文介绍了如何使用doParallel计算R中邮政编码之间的距离?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个包含两个邮政编码和相应纬度和经度的大型数据集(260 万行),我正在尝试计算它们之间的距离.我主要使用包 geosphere 来计算邮政编码之间的 Vincenty Ellipsoid 距离,但是我的数据集花费了大量时间.有什么可以快速实现的方法?

I have a large dataset (2.6M rows) with two zip codes and the corresponding latitudes and longitudes, and I am trying to compute the distance between them. I am primarily using the package geosphere to calculate Vincenty Ellipsoid distance between the zip codes but it is taking a massive amount of time for my dataset. What can be a fast way to implement this?

我的尝试

library(tidyverse)
library(geosphere)

zipdata <- select(fulldata,originlat,originlong,destlat,destlong)

## Very basic approach
for(i in seq_len(nrow(zipdata))){
  zipdata$dist1[i] <- distm(c(zipdata$originlat[i],zipdata$originlong[i]),
       c(zipdata$destlat[i],zipdata$destlong[i]),
       fun=distVincentyEllipsoid)
}

## Tidyverse approach 
zipdata <- zipdata%>%
 mutate(dist2 = distm(cbind(originlat,originlong), cbind(destlat,destlong), 
   fun = distHaversine))

这两种方法都非常慢.我知道 210 万行永远不会是快速"计算,但我认为它可以做得更快.我在较小的测试数据上尝试了以下方法,但没有任何运气,

Both of these methods are extremely slow. I understand that 2.1M rows will never be a "fast" calculation, but I think it can be made faster. I have tried the following approach on a smaller test data without any luck,

library(doParallel)
cores <- 15
cl <- makeCluster(cores)
registerDoParallel(cl)

test <- select(head(fulldata,n=1000),originlat,originlong,destlat,destlong)

foreach(i = seq_len(nrow(test))) %dopar% {
  library(geosphere)
  zipdata$dist1[i] <- distm(c(zipdata$originlat[i],zipdata$originlong[i]),
       c(zipdata$destlat[i],zipdata$destlong[i]),
       fun=distVincentyEllipsoid) 
}
stopCluster(cl)

谁能帮助我以正确的方式使用 doParallelgeosphere 或者更好的方法来处理这个问题?

Can anyone help me out with either the correct way to use doParallel with geosphere or a better way to handle this?

(部分)回复的基准

## benchmark
library(microbenchmark)
zipsamp <- sample_n(zip,size=1000000)
microbenchmark(
  dave = {
    # Dave2e
    zipsamp$dist1 <- distHaversine(cbind(zipsamp$patlong,zipsamp$patlat),
                                   cbind(zipsamp$faclong,zipsamp$faclat))
  },
  geohav = {
    zipsamp$dist2 <- geodist(cbind(long=zipsamp$patlong,lat=zipsamp$patlat),
                             cbind(long=zipsamp$faclong,lat=zipsamp$faclat),
                             paired = T,measure = "haversine")
  },
  geovin = {
    zipsamp$dist3 <- geodist(cbind(long=zipsamp$patlong,lat=zipsamp$patlat),
                             cbind(long=zipsamp$faclong,lat=zipsamp$faclat),
                             paired = T,measure = "vincenty")
  },
  geocheap = {
    zipsamp$dist4 <- geodist(cbind(long=zipsamp$patlong,lat=zipsamp$patlat),
                             cbind(long=zipsamp$faclong,lat=zipsamp$faclat),
                             paired = T,measure = "cheap")
  }
,unit = "s",times = 100)

# Unit: seconds
# expr        min         lq       mean     median         uq        max neval  cld
# dave 0.28289613 0.32010753 0.36724810 0.32407858 0.32991396 2.52930556   100    d
# geohav 0.15820531 0.17053853 0.18271300 0.17307864 0.17531687 1.14478521   100  b  
# geovin 0.23401878 0.24261274 0.26612401 0.24572869 0.24800670 1.26936889   100   c 
# geocheap 0.01910599 0.03094614 0.03142404 0.03126502 0.03203542 0.03607961   100 a  

一个简单的 all.equal 测试表明,对于我的数据集,haversine 方法等于 vincenty 方法,但与 geodist 包.

A simple all.equal test showed that for my dataset the haversine method is equal to the vincenty method, but has a "Mean relative difference: 0.01002573" with the "cheap" method from the geodist package.

推荐答案

R 是一种向量化语言,因此该函数将对向量中的所有元素进行操作.由于您正在计算每一行的原始和目的地之间的距离,因此不需要循环.矢量化方法大约是循环性能的 1000 倍.
此外,直接使用 distVincentyEllipsoid(或 distHaveersine 等)并绕过 distm 函数也应该可以提高性能.

R is a vectorized language, thus the function will operate over all of the elements in the vectors. Since you are calculating the distance between the original and destination for each row, the loop is unnecessary. The vectorized approach is approximately 1000x the performance of the loop.
Also using the distVincentyEllipsoid (or distHaveersine, etc. )directly and bypassing the distm function should also improve the performance.

在没有任何示例数据的情况下,此代码段未经测试.

Without any sample data this snippet is untested.

library(geosphere)

zipdata <- select(fulldata,originlat,originlong,destlat,destlong)

## Very basic approach
zipdata$dist1 <- distVincentyEllipsoid(c(zipdata$originlong, zipdata$originlat), 
       c(zipdata$destlong, zipdata$destlat))

注意:为了使大多数地圈功能正常工作,正确的顺序是:先经度,然后是纬度.

上面列出的 tidyverse 方法缓慢的原因是 distm 函数正在计算每个起点和终点之间的距离,这将产生一个 200 万乘 200 万的元素矩阵.

The reason the tidyverse approach listed above is slow is the distm function is calculating the distance between every origin and destination which would result in a 2 million by 2 million element matrix.

这篇关于如何使用doParallel计算R中邮政编码之间的距离?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆