按组的地理距离-在每对行上应用函数 [英] Geographical distance by group - Applying a function on each pair of rows

查看:84
本文介绍了按组的地理距离-在每对行上应用函数的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想计算每个省的房屋数量之间的平均地理距离。

I want to calculate the average geographical distance between a number of houses per province.

假设我有以下数据。

df1 <- data.frame(province = c(1, 1, 1, 2, 2, 2),
              house = c(1, 2, 3, 4, 5, 6),
              lat = c(-76.6, -76.5, -76.4, -75.4, -80.9, -85.7), 
              lon = c(39.2, 39.1, 39.3, 60.8, 53.3, 40.2))

使用地球圈

Using the geosphere library I can find the distance between two houses. For instance:

library(geosphere)
distm(c(df1$lon[1], df1$lat[1]), c(df1$lon[2], df1$lat[2]), fun = distHaversine)

#11429.1

如何计算该省所有房屋之间的距离并收集每个省的平均距离?

How do I calculate the distance between all the houses in the province and gather the mean distance per province?

原始数据集每个省都有数百万个观测值,因此此处的性能也是一个问题。

The original data-set has millions of observations per province, so performance is an issue here, too.

推荐答案

我的最初想法是查看 distHaversine 的源代码并将其复制到与 proxy
这样工作(请注意, lon 应该是第一列):

My initial idea was to look at the source code of distHaversine and replicate it in a function that I would use with proxy. That would work like this (note that lon is expected to be the first column):

library(geosphere)
library(dplyr)
library(proxy)

df1 <- data.frame(province = as.integer(c(1, 1, 1, 2, 2, 2)),
                  house = as.integer(c(1, 2, 3, 4, 5, 6)),
                  lat = c(-76.6, -76.5, -76.4, -75.4, -80.9, -85.7), 
                  lon = c(39.2, 39.1, 39.3, 60.8, 53.3, 40.2))

custom_haversine <- function(x, y) {
  toRad <- pi / 180

  diff <- (y - x) * toRad
  dLon <- diff[1L]
  dLat <- diff[2L]

  a <- sin(dLat / 2) ^ 2 + cos(x[2L] * toRad) * cos(y[2L] * toRad) * sin(dLon / 2) ^ 2
  a <- min(a, 1)
  # return
  2 * atan2(sqrt(a), sqrt(1 - a)) * 6378137
}

pr_DB$set_entry(FUN=custom_haversine, names="haversine", loop=TRUE, distance=TRUE)

average_dist <- df1 %>%
  select(-house) %>%
  group_by(province) %>%
  group_map(~ data.frame(avg=mean(proxy::dist(.x[ , c("lon", "lat")], method="haversine"))))

但是,如果您希望每个省数百万行, b $ b proxy 可能无法分配中间(矩阵的下三角)矩阵。
所以我将代码移植到C ++并添加了多线程作为奖励:

However, if you're expecting millions of rows per province, proxy probably won't be able to allocate the intermediate (lower triangular of the) matrices. So I ported the code to C++ and added multi-threading as a bonus:

EDIT :原来 s2d 辅助工具远非最佳,
此版本现在使用此处

EDIT: turns out the s2d helper was far from optimal, this version now uses the formulas given here.

EDIT2 :我刚刚发现了 RcppThread
,可用于检测用户中断。

EDIT2: I just found out about RcppThread, and it can be used to detect user interrupt.

// [[Rcpp::plugins(cpp11)]]
// [[Rcpp::depends(RcppParallel,RcppThread)]]

#include <cstddef> // size_t
#include <math.h> // sin, cos, sqrt, atan2, pow
#include <vector>

#include <RcppThread.h>
#include <Rcpp.h>
#include <RcppParallel.h>

using namespace std;
using namespace Rcpp;
using namespace RcppParallel;

// single to double indices for lower triangular of matrices without diagonal
void s2d(const size_t id, const size_t nrow, size_t& i, size_t& j) {
  j = nrow - 2 - static_cast<size_t>(sqrt(-8 * id + 4 * nrow * (nrow - 1) - 7) / 2 - 0.5);
  i = id + j + 1 - nrow * (nrow - 1) / 2 + (nrow - j) * ((nrow - j) - 1) / 2;
}

class HaversineCalculator : public Worker
{
public:
  HaversineCalculator(const NumericVector& lon,
                      const NumericVector& lat,
                      double& avg,
                      const int n)
    : lon_(lon)
    , lat_(lat)
    , avg_(avg)
    , n_(n)
    , cos_lat_(lon.length())
  {
    // terms for distance calculation
    for (size_t i = 0; i < cos_lat_.size(); i++) {
      cos_lat_[i] = cos(lat_[i] * 3.1415926535897 / 180);
    }
  }

  void operator()(size_t begin, size_t end) {
    // for Kahan summation
    double sum = 0;
    double c = 0;

    double to_rad = 3.1415926535897 / 180;

    size_t i, j;
    for (size_t ind = begin; ind < end; ind++) {
      if (RcppThread::isInterrupted(ind % static_cast<int>(1e5) == 0)) return;

      s2d(ind, lon_.length(), i, j);

      // haversine distance
      double d_lon = (lon_[j] - lon_[i]) * to_rad;
      double d_lat = (lat_[j] - lat_[i]) * to_rad;
      double d_hav = pow(sin(d_lat / 2), 2) + cos_lat_[i] * cos_lat_[j] * pow(sin(d_lon / 2), 2);
      if (d_hav > 1) d_hav = 1;
      d_hav = 2 * atan2(sqrt(d_hav), sqrt(1 - d_hav)) * 6378137;

      // the average part
      d_hav /= n_;

      // Kahan sum step
      double y = d_hav - c;
      double t = sum + y;
      c = (t - sum) - y;
      sum = t;
    }

    mutex_.lock();
    avg_ += sum;
    mutex_.unlock();
  }

private:
  const RVector<double> lon_;
  const RVector<double> lat_;
  double& avg_;
  const int n_;
  tthread::mutex mutex_;
  vector<double> cos_lat_;
};

// [[Rcpp::export]]
double avg_haversine(const DataFrame& input, const int nthreads) {
  NumericVector lon = input["lon"];
  NumericVector lat = input["lat"];

  double avg = 0;
  int size = lon.length() * (lon.length() - 1) / 2;
  HaversineCalculator hc(lon, lat, avg, size);

  int grain = size / nthreads / 10;
  RcppParallel::parallelFor(0, size, hc, grain);
  RcppThread::checkUserInterrupt();

  return avg;
}

此代码不会分配任何中间矩阵,
会分配只需计算下三角形的每一对的距离,最后累积平均值的值即可。
关于Kahan求和部分,请参见此处

This code won't allocate any intermediate matrix, it will simply calculate the distance for each pair of what would be the lower triangular and accumulate the values for an average in the end. See here for the Kahan summation part.

如果将代码保存在 haversine.cpp
中,则可以执行以下操作:

If you save that code in, say, haversine.cpp, then you can do the following:

library(dplyr)
library(Rcpp)
library(RcppParallel)
library(RcppThread)

sourceCpp("haversine.cpp")

df1 %>%
  group_by(province) %>%
  group_map(~ data.frame(avg=avg_haversine(.x, parallel::detectCores())))
# A tibble: 2 x 2
# Groups:   province [2]
  province     avg
     <int>   <dbl>
1        1  15379.
2        2 793612.

这里也是一项健全性检查:

Here's a sanity check too:

pr_DB$set_entry(FUN=geosphere::distHaversine, names="distHaversine", loop=TRUE, distance=TRUE)

df1 %>%
  select(-house) %>%
  group_by(province) %>%
  group_map(~ data.frame(avg=mean(proxy::dist(.x[ , c("lon", "lat")], method="distHaversine"))))

不过请注意:

df <- data.frame(lon=runif(1e3, -90, 90), lat=runif(1e3, -90, 90))

system.time(proxy::dist(df, method="distHaversine"))
   user  system elapsed 
 34.353   0.005  34.394

system.time(proxy::dist(df, method="haversine"))
   user  system elapsed 
  0.789   0.020   0.809

system.time(avg_haversine(df, 4L))
   user  system elapsed 
  0.054   0.000   0.014

df <- data.frame(lon=runif(1e5, -90, 90), lat=runif(1e5, -90, 90))

system.time(avg_haversine(df, 4L))
   user  system elapsed 
 73.861   0.238  19.670

如果您有数百万行,您可能必须要等一段时间...

You'll probably have to wait quite a while if you have millions of rows...

我还应该提到,无法通过 RcppParallel
创建的线程中检测到用户中断,因此,如果您开始计算,则应该等待直到完成,
或完全重新启动R / RStudio。

请参见上面的EDIT2。

I should also mention that it's not possible to detect user interrupt inside the threads created through RcppParallel, so if you start the calculation you should either wait until it finishes, or restart R/RStudio entirely. See EDIT2 above.

根据您的实际数据以及计算机具有多少个内核,
可能需要等待几天才能完成计算。
这个问题的平方复杂度是
(可以说每个省)。
这行:

Depending on your actual data and how many cores your computer has, you may very well end up waiting days for the calculation to finish. This problem has quadratic complexity (per province, so to speak). This line:

int size = lon.length() * (lon.length() - 1) / 2;

表示必须执行的(haversine)距离计算量。
因此,如果行数增加了 n ,则
计算的数量增加了 n粗略地说,是^ 2/2

signifies the amount of (haversine) distance calculations that must be performed. So if the number of rows increases by a factor of n, the number of calculations increases by a factor of n^2 / 2, roughly speaking.

没有办法对此进行优化。
,如果不先计算每个数字,就无法计算 N 个数字的平均值,
很难找到比多线程C ++代码
,因此您要么不得不等待它,要么
要么抛出更多内核,要么单台机器或多台机器一起工作就需要

否则您将无法解决此问题。

There is no way to optimize this; you can't calculate the average of N numbers without actually computing each number first, and you'll have a hard time finding something faster than multi-threaded C++ code, so you'll either have to wait it out, or throw more cores at the problem, either with a single machine or with many machines working together. Otherwise you can't solve this problem.

这篇关于按组的地理距离-在每对行上应用函数的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆