(空间)在一个点的 X 米内找到所有点的有效方法? [英] (Spatial) Efficient way of finding all points within X meters of a point?

查看:35
本文介绍了(空间)在一个点的 X 米内找到所有点的有效方法?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个大型空间数据集(1200 万行).几何图形是地图上的点.对于数据集中的每一行,我想找到该点 500 米内的所有点.

在 r 中,使用 sf,我一直试图通过并行循环遍历每一行并运行 st_buffer 和 st_intersects,然后将结果保存为键值格式的列表(键是原点,值是邻居).

问题是数据集太大.即使并行到 60 个以上的内核,操作时间也太长(> 1 周,通常会崩溃).

这种蛮力方法的替代方法是什么?是否可以使用 sf 构建索引?也许将操作推送到外部数据库?

正则表达式:

库(sf)图书馆(tidyverse)图书馆(并行)图书馆(foreach)# 示例数据,转换为十进制:nc <- st_read(system.file("shape/nc.shp", package="sf")) %>% st_transform(32618)# 稍微扩展数据以使示例更有趣:nc <- rbind(nc,nc,nc)nc <- nc %>% mutate(Id = row_number())## 如果需要,可以并行运行:# num_cores <- parallel::detectCores()-2# cl <- makeSOCKcluster(num_cores)# registerDoSNOW(cl)# 或者只是按顺序运行:registerDoSEQ()邻居 <- foreach(ii = 1:nrow(nc), .verbose = FALSE, .errorhandling = "pass") %dopar% {l = 500 # 500 米# 将行隔离为原点:row_interest <- filter(nc, row_number()==ii)# 创建缓冲区:缓冲区<- row_interest %>% st_buffer(dist = l)# 提取邻居的行号comps_idx <-suppressMessages(st_intersects(buffer, nc))[[1]]# 获取所有邻居:comps <- nc %>% filter(row_number() %in% comps_idx)# 删除几何:comps <- comps %>% st_set_geometry(NULL)# 没有邻居的情况下的流量控制:如果(nrow(comps)> 0){comps$Origin_Key <- row_interest$Id} 别的 {comps <- data_frame("lat" = NA_integer_,"lon" = NA_integer_, "bbl" = row_interest$bbl)comps$Origin_Key <- row_interest$Id}回报(补偿)}关闭所有连接()长度(邻居)==nrow(nc)[1] 真

解决方案

使用 sf 对象时,显式循环功能以执行诸如相交之类的二元运算通常会适得其反(另请参见

因此,这里我们已经在串行"实现上获得了5-6 倍的速度提升,并且由于超过 6 个内核的并行化而又获得了 5 倍.尽管此处显示的时间仅是指示性的,并且与我们构建的特定测试数据集(在分布较不均匀的数据集上,我预计速度提升较低)我认为这非常好.

HTH!

PS:可以在这里找到更彻底的分析:

https://lbusettspatialr.blogspot.it/2018/02/speeding-up-spatial-analysis-by.html

I have a large spatial dataset (12M rows). The geometries are points on a map. For each row in the dataset, I'd like to find all the points that are within 500 meters of that point.

In r, using sf, I've been trying to do this by parallel looping through each row and running st_buffer and st_intersects, then saving the result as a list in a key-value format (the key being the origin point, the values being the neighbors).

The issue is that the dataset is too large. Even when parallelizing to upwards of 60 cores, the operation takes too long (>1 week and usually crashes).

What are the alternatives to this brute-force approach? Is it possible to build indexes using sf? Perhaps push the operation to an external database?

Reprex:

library(sf)
library(tidyverse)
library(parallel)
library(foreach)


# example data, convert to decimal:
nc <- st_read(system.file("shape/nc.shp", package="sf")) %>% st_transform(32618)
# expand the data a a bit to make the example more interesting:
nc <- rbind(nc,nc,nc)
nc <- nc %>% mutate(Id = row_number())


## can run in parallel if desired:
# num_cores <- parallel::detectCores()-2
# cl <- makeSOCKcluster(num_cores)
# registerDoSNOW(cl)

# or just run in sequence:
registerDoSEQ()

neighbors <- foreach(ii = 1:nrow(nc)
                      , .verbose = FALSE
                      , .errorhandling = "pass") %dopar% {

                        l = 500 # 500 meters

                        # isolate the row as the origin point:
                        row_interest <- filter(nc, row_number()==ii)

                        # create the buffer:
                        buffer <- row_interest %>% st_buffer(dist = l)

                        # extract the row numbers of the neighbors
                        comps_idx <- suppressMessages(st_intersects(buffer, nc))[[1]]

                        # get all the neighbors:
                        comps <- nc %>% filter(row_number() %in% comps_idx)

                        # remove the geometry:
                        comps <- comps %>% st_set_geometry(NULL)

                        # flow control in case there are no neibors:
                        if(nrow(comps)>0) {
                          comps$Origin_Key <- row_interest$Id
                        } else {
                          comps <- data_frame("lat" = NA_integer_,"lon" = NA_integer_, "bbl" = row_interest$bbl)
                          comps$Origin_Key <- row_interest$Id
                        }


                        return(comps)
                      }

closeAllConnections()

length(neighbors)==nrow(nc)
[1] TRUE

解决方案

When working with sf objects, explicitly looping over features to perform binary operations such as intersects is usually counterproductive (see also How can I speed up spatial operations in `dplyr::mutate()`?)

An approach similar to yours (i.e., buffering and intersecting), but without the explicit for loop works better.

Let's see how it performs on a reasonably big dataset of 50000 points:

library(sf)
library(spdep)
library(sf)

pts <- data.frame(x = runif(50000, 0, 100000),
                  y = runif(50000, 0, 100000))
pts     <- sf::st_as_sf(pts, coords = c("x", "y"), remove = F)
pts_buf <- sf::st_buffer(pts, 5000)
coords  <- sf::st_coordinates(pts)

microbenchmark::microbenchmark(
  sf_int = {int <- sf::st_intersects(pts_buf, pts)},
  spdep  = {x   <- spdep::dnearneigh(coords, 0, 5000)}
  , times = 1)
#> Unit: seconds
#>    expr       min        lq      mean    median        uq       max neval
#>  sf_int  21.56186  21.56186  21.56186  21.56186  21.56186  21.56186     1
#>   spdep 108.89683 108.89683 108.89683 108.89683 108.89683 108.89683     1

You can see here that the st_intersects approach is 5 times faster than the dnearneigh one.

Unfortunately, this is unlikely to solve your problem. Looking at execution times for datasets of different sizes we get:

subs <- c(1000, 3000, 5000, 10000, 15000, 30000, 50000)
times <- NULL
for (sub in subs[1:7]) {
  pts_sub <- pts[1:sub,]
  buf_sub <- pts_buf[1:sub,]
  t0 <- Sys.time()
  int <- sf::st_intersects(buf_sub, pts_sub)
  times <- cbind(times, as.numeric(difftime(Sys.time() , t0, units = "secs")))
}

plot(subs, times)

times <- as.numeric(times)
reg <- lm(times~subs+I(subs^2))
summary(reg)
#> 
#> Call:
#> lm(formula = times ~ subs + I(subs^2))
#> 
#> Residuals:
#>        1        2        3        4        5        6        7 
#> -0.16680 -0.02686  0.03808  0.21431  0.10824 -0.23193  0.06496 
#> 
#> Coefficients:
#>               Estimate Std. Error t value Pr(>|t|)    
#> (Intercept)  2.429e-01  1.371e-01   1.772    0.151    
#> subs        -2.388e-05  1.717e-05  -1.391    0.237    
#> I(subs^2)    8.986e-09  3.317e-10  27.087  1.1e-05 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 0.1908 on 4 degrees of freedom
#> Multiple R-squared:  0.9996, Adjusted R-squared:  0.9994 
#> F-statistic:  5110 on 2 and 4 DF,  p-value: 1.531e-07

Here, we see an almost perfect quadratic relationship between time and number of points (as would be expected). On a 10M points subset, assuming that the behaviour does not change, you would get:

predict(reg, newdata = data.frame(subs = 10E6))
#>        1 
#> 898355.4

, which corresponds to about 10 days, assuming that the trend is constant when further increasing the number of points (but the same would happen for dnearneigh...)

My suggestion would be to "split" your points in chunks and then work on a per-split basis.

You could for example order your points at the beginning along the x-axis and then easily and quickly extract subsets of buffers and of points with which to compare them using data.table.

Clearly, the "points" buffer would need to be larger than that of "buffers" according to the comparison distance. So, for example, if you make a subset of pts_buf with centroids in [50000 - 55000], the corresponding subset of pts should include points in the range [49500 - 55500]. This approach is easily parallelizable by assigning the different subsets to different cores in a foreach or similar construct.

I do not even know if using spatial objects/operations is beneficial here, since once we have the coordinates all is needed is computing and subsetting euclidean distances: I suspect that a carefully coded brute force data.table-based approach could be also a feasible solution.

HTH!

UPDATE

In the end, I decided to give it a go and see how much speed we could gain from this kind of approach. Here is a possible implementation:

points_in_distance_parallel <- function(in_pts,
                                        maxdist,
                                        ncuts = 10) {

  require(doParallel)
  require(foreach)
  require(data.table)
  require(sf)
  # convert points to data.table and create a unique identifier
  pts <-  data.table(in_pts)
  pts <- pts[, or_id := 1:dim(in_pts)[1]]

  # divide the extent in quadrants in ncuts*ncuts quadrants and assign each
  # point to a quadrant, then create the index over "xcut"
  range_x  <- range(pts$x)
  limits_x <-(range_x[1] + (0:ncuts)*(range_x[2] - range_x[1])/ncuts)
  range_y  <- range(pts$y)
  limits_y <- range_y[1] + (0:ncuts)*(range_y[2] - range_y[1])/ncuts
  pts[, `:=`(xcut =  as.integer(cut(x, ncuts, labels = 1:ncuts)),
             ycut = as.integer(cut(y, ncuts, labels = 1:ncuts)))] %>%
    setkey(xcut, ycut)

  results <- list()

  cl <- parallel::makeCluster(parallel::detectCores() - 2, type =
                                ifelse(.Platform$OS.type != "windows", "FORK",
                                       "PSOCK"))
  doParallel::registerDoParallel(cl)
  # start cycling over quadrants
  out <- foreach(cutx = seq_len(ncuts)), .packages = c("sf", "data.table")) %dopar% {

    count <- 0

    # get the points included in a x-slice extended by `dist`, and build
    # an index over y
    min_x_comp    <- ifelse(cutx == 1, limits_x[cutx], (limits_x[cutx] - maxdist))
    max_x_comp    <- ifelse(cutx == ncuts,
                            limits_x[cutx + 1],
                            (limits_x[cutx + 1] + maxdist))
    subpts_x <- pts[x >= min_x_comp & x < max_x_comp] %>%
      setkey(y)

    for (cuty in seq_len(pts$ycut)) {

      count <- count + 1

      # subset over subpts_x to find the final set of points needed for the
      # comparisons
      min_y_comp  <- ifelse(cuty == 1,
                            limits_y[cuty],
                            (limits_y[cuty] - maxdist))
      max_y_comp  <- ifelse(cuty == ncuts,
                            limits_y[cuty + 1],
                            (limits_y[cuty + 1] + maxdist))
      subpts_comp <- subpts_x[y >= min_y_comp & y < max_y_comp]

      # subset over subpts_comp to get the points included in a x/y chunk,
      # which "neighbours" we want to find. Then buffer them.
      subpts_buf <- subpts_comp[ycut == cuty & xcut == cutx] %>%
        sf::st_as_sf() %>%
        st_buffer(maxdist)

      # retransform to sf since data.tables lost the geometric attrributes
      subpts_comp <- sf::st_as_sf(subpts_comp)

      # compute the intersection and save results in a element of "results".
      # For each point, save its "or_id" and the "or_ids" of the points within "dist"

      inters <- sf::st_intersects(subpts_buf, subpts_comp)

      # save results
      results[[count]] <- data.table(
        id = subpts_buf$or_id,
        int_ids = lapply(inters, FUN = function(x) subpts_comp$or_id[x]))

    }
    return(data.table::rbindlist(results))
  }
parallel::stopCluster(cl)
data.table::rbindlist(out)
}

The function takes as input a points sf object, a target distance and a number of "cuts" to use to divide the extent in quadrants, and provides in output a data frame in which, for each original point, the "ids" of the points within maxdist are reported in the int_ids list column.

On on a test dataset with a varying number of uniformly distributed point, and two values of maxdist I got these kind of results (the "parallel" run is done using 6 cores):

So, here we get a 5-6X speed improvement already on the "serial" implementation, and another 5X thanks to parallelization over 6 cores. Although the timings shown here are merely indicative, and related to the particular test-dataset we built (on a less uniformly distributed dataset I wouldexpect a lower speed improvement) I think this is quite good.

HTH!

PS: a more thorough analysis can be found here:

https://lbusettspatialr.blogspot.it/2018/02/speeding-up-spatial-analyses-by.html

这篇关于(空间)在一个点的 X 米内找到所有点的有效方法?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆