如何在dplyr :: mutate()中加快空间运算? [英] How can I speed up spatial operations in `dplyr::mutate()`?

查看:65
本文介绍了如何在dplyr :: mutate()中加快空间运算?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用 sf 包与 dplyr purrr 。



我希望在 mutate 调用内执行空间运算,像这样:

  simple_feature%&%;%
mutate(geometry_area = map_dbl(geometry,〜as.double(st_area(.x))))

我喜欢这种方法,允许我使用%&%;% mutate 进行一系列空间运算。 / p>

不喜欢,这种方法似乎会大大增加 sf 的运行时间功能(有时令人望而却步),我希望听到有关如何克服这种速度损失的建议。



这里是一个详细说明速度损失问题的代表。

>





请注意:这不是最小示例,需要从ESRI REST API下载一些软件包和一个文件。希望
对我好;)


此示例中的目标是添加一个新列,指示如图所示,每个北卡罗来纳州县( nc )是否与任何水域多边形( nc_wtr )相交下方:





我创建了一个执行此计算的函数: st_intersects_any()



然后我进行基准测试函数对两个数据集( nc nc_1e4 )起作用,首先使用 st_intersects_any()本身,然后在 mutate 调用中使用它。

  ## |测试|淘汰| 
## |:------------------ | --------:|
## | bm_sf_small | 0.01 |
## | bm_sf_dplyr_small | 1.22 |
## | bm_sf_large | 0.95 |
## | bm_sf_dplyr_large | 122.88 |

基准清楚地表明 dplyr 方法速度要慢得多,我希望有人建议减少或消除这种速度损失,同时仍然使用 dplyr 方法。



如果使用 data.table 或其他应检查的方法有明显更快的方法



谢谢!



Reprex





 #设置---- 

库(lwgeom)# devtools :: install_github('r-spatial / lwgeom)
库(tidyverse)
库(sf)
库(esri2sf)#devtools :: install_github('yonghah / esri2sf')
库(rbenchmark)
库(knitr)

#创建新的sf函数:st_intersects_any ----

st_intersects_any<-函数(x, y){
st_intersects(x,y)%&%;%
map_lgl(〜length(.x)> 0)
}

#加载数据- ---
#NC县

nc<-read_sf(system.file( shape / nc.shp,package = sf))%>%
st_transform(32119)

nc_1e4<-list(nc)%>%
rep(times = 1e2)%&%;%
reduce(rbind)

#NC分水岭

url<- https://services.nconemap.gov/secure/rest/services/NC1Map_Watersheds/MapServer/2

nc_wtr<-esri2sf(url)
##警告:软件包 httr是在R版本3.4.2下构建的
##
##附加软件包:'jsonlite'
##以下对象从'package:purrr'中屏蔽:
##
##展平
## [1]功能层
## [1] esriGeometryPolygon

nc_wtr<-st_transform(nc_wtr,32119)%>%
st_simplify(dTolerance = 100)#简化水体几何图形

#绘制图数据

par(mar = rep(.1,4))
plot(st_geometry(nc),lwd = 1)
plot(st_geometry(nc_wtr),col = alpha ( blue,.3),lwd = 1.5,add = TRUE)

#比较这两种方法

cols <-c( elapsed, relative )

bm_sf_small<-基准({
st_intersects_any(nc,nc_wtr)
},column = cols,复制= 1)

bm_sf_dplyr_small <-基准({
n c%>%transmute(INT = map_lgl(geometry,st_intersects_any,y = nc_wtr))
},column = cols,复制= 1)
##警告:软件包'bindrcpp'是在R下构建的版本3.4.2

bm_sf_large<-基准({
st_intersects_any(nc_1e4,nc_wtr)
},column = cols,复制= 1)

bm_sf_dplyr_large<-基准({
nc_1e4%>%transmute(INT = map_lgl(geometry,st_intersects_any,y = nc_wtr))
},column = cols,复制= 1)

测试<-列表(bm_sf_small,bm_sf_dplyr_small,bm_sf_large,bm_sf_dplyr_large)

tbl<-tibble(
TEST = c( bm_sf_small, bm_sf_d_bm , bm_sf_dplyr_large),
ELAPSED = map_dbl(测试,经过)


kable(tbl,format = markdown,padding = 2)

## |测试|淘汰|
## |:------------------ | --------:|
## | bm_sf_small | 0.01 |
## | bm_sf_dplyr_small | 1.22 |
## | bm_sf_large | 0.95 |
## | bm_sf_dplyr_large | 122.88 |





devtools :: session_info()
##会话信息------------ -------------------------------------------------
##设置值
##版本R版本3.4.0(2017-04-21)
##系统x86_64,mingw32
## ui RTerm
##语言(ZH)
##整理English_United States.1252
## tz美国/洛杉矶
##日期2018-01-31
##套餐------ -------------------------------------------------- ---------
##软件包*版本日期来源
##断言0.2.0 2017-04-11 CRAN(R 3.4.2)
## backports 1.1 .0 2017-05-22 CRAN(R 3.4.0)
##基本* 3.4.0 2017-04-21本地
##绑定器0.1 2016-11-13 CRAN(R 3.4.2 )
## bindrcp p * 0.2 2017-06-17 CRAN(R 3.4.2)
##扫帚0.4.3 2017-11-20 CRAN(R 3.4.3)
## cellranger 1.1.0 2016-07 -27 CRAN(R 3.4.2)
## class 7.3-14 2015-08-30 CRAN(R 3.4.0)
## classInt 0.1-24 2017-04-16 CRAN(R 3.4 .2)
## cli 1.0.0 2017-11-05 CRAN(R 3.4.2)
## colorspace 1.3-2 2016-12-14 CRAN(R 3.4.2)
##编译器3.4.0 2017-04-21本地
##蜡笔1.3.4 2017-10-30 Github(r-lib / crayon @ b5221ab)
## curl 3.0 2017-10- 06 CRAN(R 3.4.2)
##数据集* 3.4.0 2017-04-21本地
## DBI 0.7 2017-06-18 CRAN(R 3.4.2)
# #devtools 1.13.2 2017-06-02 CRAN(R 3.4 .0)
##摘要0.6.13 2017-12-14 CRAN(R 3.4.3)
## dplyr * 0.7.4 2017-09-28 CRAN(R 3.4.2)
## e1071 1.6-8 2017-02-02 CRAN(R 3.4.2)
## esri2sf * 0.1.0 2017-12-12 Github(yonghah / esri2sf @ 81d211f)
##评估0.10.1 2017-06-24 CRAN(R 3.4.3)
## forcats * 0.2.0 2017-01-23 CRAN(R 3.4.3)
##国外0.8-67 2016 -09-13 CRAN(R 3.4.0)
## ggplot2 * 2.2.1.9000 2017-12-02 Github(tidyverse / ggplot2 @ 7b5c185)
##胶水1.2.0.9000 2018-01-13 Github(tidyverse / glue @ 1592ee1)
##图形* 3.4.0 2017-04-21本地
## grDevices * 3.4.0 2017-04-21本地
##网格3.4 .0 2017-04-21本地
## gtable 0.2.0 2016-02-26 CRAN(R 3.4.2)
##避风港1.1.0 2017-07-09 CRAN(R 3.4.2)
## hms 0.4.0 2017-11 -23 CRAN(R 3.4.3)
## htmltools 0.3.6 2017-04-28 CRAN(R 3.4.0)
## httr * 1.3.1 2017-08-20 CRAN(R 3.4.2)
## jsonlite * 1.5 2017-06-01 CRAN(R 3.4.0)
## knitr 1.18 2017-12-27 CRAN(R 3.4.3)
# #晶格0.20-35 2017-03-25 CRAN(R 3.4.0)
## lazyeval 0.2.1 2017-10-29 CRAN(R 3.4.2)
## lubridate 1.7.1 2017 -11-03 CRAN(R 3.4.2)
## lwgeom * 0.1-1 2017-12-16 Github(r-spatial / lwgeom @ baf22c6)
## magrittr 1.5 2014-11-22 CRAN(R 3.4.0)
##备忘录1.1.0 2017-04-21 CRAN(R 3.4 .0)
##方法* 3.4.0 2017-04-21本地
## mnormt 1.5-5 2016-10-15 CRAN(R 3.4.1)
## modelr 0.1 .1 2017-07-24 CRAN(R 3.4.2)
##孟塞尔0.4.3 2016-02-13 CRAN(R 3.4.2)
## nlme 3.1-131 2017-02- 06 CRAN(R 3.4.0)
##平行3.4.0 2017-04-21本地
##支柱1.0.99.9001 2018-01-16 Github(r-lib / pillar @ 9d96835)
## pkgconfig 2.0.1 2017-03-21 CRAN(R 3.4.2)
## plyr 1.8.4 2016-06-08 CRAN(R 3.4.2)
## psych 1.7.8 2017-09-09 CRAN(R 3.4.2)
## purrr * 0.2.4.9000 2017-12-05 Github(tidyverse / purrr @ 62b135a)
## R6 2.2.2 2017年-06-17 CRAN(R 3.4.0)
## rbenchm方舟* 1.0.0 2012-08-30 CRAN(R 3.4.1)
## Rcpp 0.12.15 2018-01-20 CRAN(R 3.4.3)
##读取器* 1.1.1 2017-05-16 CRAN(R 3.4.2)
## readxl 1.0.0 2017-04-18 CRAN(R 3.4.2)
## reshape2 1.4.2 2016-10-22 CRAN (R 3.4.2)
## rlang 0.1.6 2017-12-21 CRAN(R 3.4.3)
## rmarkdown 1.8 2017-11-17 CRAN(R 3.4.2)
## rprojroot 1.3-2 2018-01-03 CRAN(R 3.4.3)
## rvest 0.3.2 2016-06-17 CRAN(R 3.4.2)
##可缩放0.5 .0.9000 2017-12-02 Github(hadley / scales @ d767915)
## sf * 0.6-1 2018-01-24 Github(r-spatial / sf @ 7ea67a5)
##统计* 3.4 .0 2017-04-21本地
##字符串1.1.6 2017-11-17 CRAN(R 3.4 .2)
##字符串* 1.2.0 2017-02-18 CRAN(R 3.4.0)
## tibble * 1.4.1.9000 2018-01-18 Github(tidyverse / tibble @ 64fedbd)
## tidyr * 0.7.2.9000 2018-01-13 Github(tidyverse / tidyr @ 74bd48f)
## tidyverse * 1.2.1 2017-11-14 CRAN(R 3.4.3)
##工具3.4.0 2017-04-21本地
## udunits2 0.13 2016-11-17 CRAN(R 3.4.1)
##单位0.5-1 2018-01-08 CRAN( R 3.4.3)
## utf8 1.1.3 2018-01-03 CRAN(R 3.4.3)
## utils * 3.4.0 2017-04-21本地
## withr 2.1.1.9000 2018-01-13 Github(jimhester / withr @ df18523)
## xml2 1.1.1 2017-01-24 CRAN(R 3.4.2)
## yaml 2.1.14 2016年-11-12 CRAN(R 3.4.0)


解决方案

您只需将不必要的 map_lgl 调用放在管道中即可:

  bm_sf_dplyr_large_fast<-基准({
int_new<-nc_1e4% >%mutate(INT = st_intersects_any(。,nc_wtr))
},列= cols,复制= 1)
bm_sf_dplyr_large_fast

#bm_sf_dplyr_large_fast

#1 0.829 1

巨大的减速取决于因为在这种情况下映射到几何行是有害的,因为然后进行一个环状的一对多多边形相交。



除了通过子集引入的开销之外,我认为这要比直接多对多慢得多,因为您可能大多失去了 sf 空间索引功能对象,大大加快了相交操作(请参见< a href = http://r-spatial.org/r/2017/06/22/spatial-index.html rel = nofollow noreferrer> http://r-spatial.org/r/2017/06 /22/spatial-index.html )。 (还请注意,我用 mutate`代替了 transmute'-这也带来了一些开销)。



HTH


I am working on a spatial problem using the sf package in conjunction with dplyr and purrr.

I would prefer to perform spatial operations inside a mutate call, like so:

simple_feature %>%
  mutate(geometry_area = map_dbl(geometry, ~ as.double(st_area(.x))))

I like that this approach allows me to run a series of spatial operations using %>% and mutate.

I dislike that this approach seems to significantly increase the run-time of the sf functions (sometimes prohibitively) and I would appreciate hearing suggestions about how to overcome this speed loss.

Here is a reprex that illustrates the speed loss problem in detail.


Please note: this is not a minimal example and requires downloading a few packages and one file from an ESRI REST API. I hope you'll be kind with me ;)

The objective in this example is to add a new column indicating whether each North Carolina county (nc) intersects with any of the waterbodies polygons (nc_wtr), as shown in the image below:

I created a function that performs this calculation: st_intersects_any()

Then I benchmark that function on two datasets (nc and nc_1e4), first using st_intersects_any() by itself and then using it inside a mutate call.

## |TEST               |  ELAPSED|
## |:------------------|--------:|
## |bm_sf_small        |     0.01|
## |bm_sf_dplyr_small  |     1.22|
## |bm_sf_large        |     0.95|
## |bm_sf_dplyr_large  |   122.88|

The benchmarks clearly show that the dplyr approach is substantially slower, and I'm hoping that someone has a suggestion for reducing or eliminating this speed loss while still using the dplyr approach.

If there are significantly faster ways to do this using data.table or some other method that I should check out please let me know about those as well.

Thanks!

Reprex

# Setup ----

library(lwgeom) # devtools::install_github('r-spatial/lwgeom) 
library(tidyverse) 
library(sf) 
library(esri2sf) # devtools::install_github('yonghah/esri2sf')
library(rbenchmark) 
library(knitr)

# Create the new sf function: st_intersects_any ----

st_intersects_any <- function(x, y) {
  st_intersects(x, y) %>%
    map_lgl(~ length(.x) > 0)
}

# Load data ----
# NC counties

nc <- read_sf(system.file("shape/nc.shp", package = "sf")) %>%
  st_transform(32119)

nc_1e4 <- list(nc) %>%
  rep(times = 1e2) %>%
  reduce(rbind)

# NC watersheds

url <- "https://services.nconemap.gov/secure/rest/services/NC1Map_Watersheds/MapServer/2"

nc_wtr <- esri2sf(url)
## Warning: package 'httr' was built under R version 3.4.2
## 
## Attaching package: 'jsonlite'
## The following object is masked from 'package:purrr':
## 
##     flatten
## [1] "Feature Layer"
## [1] "esriGeometryPolygon"

nc_wtr <- st_transform(nc_wtr, 32119) %>%
  st_simplify(dTolerance = 100) # simplify the waterbodies geometries

# plot the data

par(mar = rep(.1, 4))
plot(st_geometry(nc), lwd = 1)
plot(st_geometry(nc_wtr), col = alpha("blue", .3), lwd = 1.5, add = TRUE)

# Benchmark the two approaches

cols <- c("elapsed", "relative")

bm_sf_small <- benchmark({
  st_intersects_any(nc, nc_wtr)
}, columns = cols, replications = 1)

bm_sf_dplyr_small <- benchmark({
  nc %>% transmute(INT = map_lgl(geometry, st_intersects_any, y = nc_wtr))
}, columns = cols, replications = 1)
## Warning: package 'bindrcpp' was built under R version 3.4.2

bm_sf_large <- benchmark({
  st_intersects_any(nc_1e4, nc_wtr)
}, columns = cols, replications = 1)

bm_sf_dplyr_large <- benchmark({
  nc_1e4 %>% transmute(INT = map_lgl(geometry, st_intersects_any, y = nc_wtr))
}, columns = cols, replications = 1)

tests <- list(bm_sf_small, bm_sf_dplyr_small, bm_sf_large, bm_sf_dplyr_large)

tbl <- tibble(
  TEST = c("bm_sf_small", "bm_sf_dplyr_small", "bm_sf_large", "bm_sf_dplyr_large"),
  ELAPSED = map_dbl(tests, "elapsed")
)

kable(tbl,format = "markdown", padding = 2)

## |TEST               |  ELAPSED|
## |:------------------|--------:|
## |bm_sf_small        |     0.01|
## |bm_sf_dplyr_small  |     1.22|
## |bm_sf_large        |     0.95|
## |bm_sf_dplyr_large  |   122.88|





devtools::session_info()
## Session info -------------------------------------------------------------
##  setting  value                       
##  version  R version 3.4.0 (2017-04-21)
##  system   x86_64, mingw32             
##  ui       RTerm                       
##  language (EN)                        
##  collate  English_United States.1252  
##  tz       America/Los_Angeles         
##  date     2018-01-31
## Packages -----------------------------------------------------------------
##  package    * version     date       source                            
##  assertthat   0.2.0       2017-04-11 CRAN (R 3.4.2)                    
##  backports    1.1.0       2017-05-22 CRAN (R 3.4.0)                    
##  base       * 3.4.0       2017-04-21 local                             
##  bindr        0.1         2016-11-13 CRAN (R 3.4.2)                    
##  bindrcpp   * 0.2         2017-06-17 CRAN (R 3.4.2)                    
##  broom        0.4.3       2017-11-20 CRAN (R 3.4.3)                    
##  cellranger   1.1.0       2016-07-27 CRAN (R 3.4.2)                    
##  class        7.3-14      2015-08-30 CRAN (R 3.4.0)                    
##  classInt     0.1-24      2017-04-16 CRAN (R 3.4.2)                    
##  cli          1.0.0       2017-11-05 CRAN (R 3.4.2)                    
##  colorspace   1.3-2       2016-12-14 CRAN (R 3.4.2)                    
##  compiler     3.4.0       2017-04-21 local                             
##  crayon       1.3.4       2017-10-30 Github (r-lib/crayon@b5221ab)     
##  curl         3.0         2017-10-06 CRAN (R 3.4.2)                    
##  datasets   * 3.4.0       2017-04-21 local                             
##  DBI          0.7         2017-06-18 CRAN (R 3.4.2)                    
##  devtools     1.13.2      2017-06-02 CRAN (R 3.4.0)                    
##  digest       0.6.13      2017-12-14 CRAN (R 3.4.3)                    
##  dplyr      * 0.7.4       2017-09-28 CRAN (R 3.4.2)                    
##  e1071        1.6-8       2017-02-02 CRAN (R 3.4.2)                    
##  esri2sf    * 0.1.0       2017-12-12 Github (yonghah/esri2sf@81d211f)  
##  evaluate     0.10.1      2017-06-24 CRAN (R 3.4.3)                    
##  forcats    * 0.2.0       2017-01-23 CRAN (R 3.4.3)                    
##  foreign      0.8-67      2016-09-13 CRAN (R 3.4.0)                    
##  ggplot2    * 2.2.1.9000  2017-12-02 Github (tidyverse/ggplot2@7b5c185)
##  glue         1.2.0.9000  2018-01-13 Github (tidyverse/glue@1592ee1)   
##  graphics   * 3.4.0       2017-04-21 local                             
##  grDevices  * 3.4.0       2017-04-21 local                             
##  grid         3.4.0       2017-04-21 local                             
##  gtable       0.2.0       2016-02-26 CRAN (R 3.4.2)                    
##  haven        1.1.0       2017-07-09 CRAN (R 3.4.2)                    
##  hms          0.4.0       2017-11-23 CRAN (R 3.4.3)                    
##  htmltools    0.3.6       2017-04-28 CRAN (R 3.4.0)                    
##  httr       * 1.3.1       2017-08-20 CRAN (R 3.4.2)                    
##  jsonlite   * 1.5         2017-06-01 CRAN (R 3.4.0)                    
##  knitr        1.18        2017-12-27 CRAN (R 3.4.3)                    
##  lattice      0.20-35     2017-03-25 CRAN (R 3.4.0)                    
##  lazyeval     0.2.1       2017-10-29 CRAN (R 3.4.2)                    
##  lubridate    1.7.1       2017-11-03 CRAN (R 3.4.2)                    
##  lwgeom     * 0.1-1       2017-12-16 Github (r-spatial/lwgeom@baf22c6) 
##  magrittr     1.5         2014-11-22 CRAN (R 3.4.0)                    
##  memoise      1.1.0       2017-04-21 CRAN (R 3.4.0)                    
##  methods    * 3.4.0       2017-04-21 local                             
##  mnormt       1.5-5       2016-10-15 CRAN (R 3.4.1)                    
##  modelr       0.1.1       2017-07-24 CRAN (R 3.4.2)                    
##  munsell      0.4.3       2016-02-13 CRAN (R 3.4.2)                    
##  nlme         3.1-131     2017-02-06 CRAN (R 3.4.0)                    
##  parallel     3.4.0       2017-04-21 local                             
##  pillar       1.0.99.9001 2018-01-16 Github (r-lib/pillar@9d96835)     
##  pkgconfig    2.0.1       2017-03-21 CRAN (R 3.4.2)                    
##  plyr         1.8.4       2016-06-08 CRAN (R 3.4.2)                    
##  psych        1.7.8       2017-09-09 CRAN (R 3.4.2)                    
##  purrr      * 0.2.4.9000  2017-12-05 Github (tidyverse/purrr@62b135a)  
##  R6           2.2.2       2017-06-17 CRAN (R 3.4.0)                    
##  rbenchmark * 1.0.0       2012-08-30 CRAN (R 3.4.1)                    
##  Rcpp         0.12.15     2018-01-20 CRAN (R 3.4.3)                    
##  readr      * 1.1.1       2017-05-16 CRAN (R 3.4.2)                    
##  readxl       1.0.0       2017-04-18 CRAN (R 3.4.2)                    
##  reshape2     1.4.2       2016-10-22 CRAN (R 3.4.2)                    
##  rlang        0.1.6       2017-12-21 CRAN (R 3.4.3)                    
##  rmarkdown    1.8         2017-11-17 CRAN (R 3.4.2)                    
##  rprojroot    1.3-2       2018-01-03 CRAN (R 3.4.3)                    
##  rvest        0.3.2       2016-06-17 CRAN (R 3.4.2)                    
##  scales       0.5.0.9000  2017-12-02 Github (hadley/scales@d767915)    
##  sf         * 0.6-1       2018-01-24 Github (r-spatial/sf@7ea67a5)     
##  stats      * 3.4.0       2017-04-21 local                             
##  stringi      1.1.6       2017-11-17 CRAN (R 3.4.2)                    
##  stringr    * 1.2.0       2017-02-18 CRAN (R 3.4.0)                    
##  tibble     * 1.4.1.9000  2018-01-18 Github (tidyverse/tibble@64fedbd) 
##  tidyr      * 0.7.2.9000  2018-01-13 Github (tidyverse/tidyr@74bd48f)  
##  tidyverse  * 1.2.1       2017-11-14 CRAN (R 3.4.3)                    
##  tools        3.4.0       2017-04-21 local                             
##  udunits2     0.13        2016-11-17 CRAN (R 3.4.1)                    
##  units        0.5-1       2018-01-08 CRAN (R 3.4.3)                    
##  utf8         1.1.3       2018-01-03 CRAN (R 3.4.3)                    
##  utils      * 3.4.0       2017-04-21 local                             
##  withr        2.1.1.9000  2018-01-13 Github (jimhester/withr@df18523)  
##  xml2         1.1.1       2017-01-24 CRAN (R 3.4.2)                    
##  yaml         2.1.14      2016-11-12 CRAN (R 3.4.0)

解决方案

you can considerably speed-up this by simply dropping the unnecessary map_lgl call in the pipe:

bm_sf_dplyr_large_fast <- benchmark({
  int_new <- nc_1e4 %>% mutate(INT = st_intersects_any(., nc_wtr))
}, columns = cols, replications = 1)
bm_sf_dplyr_large_fast

# bm_sf_dplyr_large_fast
# elapsed relative
# 1   0.829        1

The huge slow down depends from the fact that mapping over geometry rows is in this case detrimental, because you then do a looped one-to-multi polygon intersection.

Besides the overhead introduced by subsetting, I believe this is much slower than a straight-on multi-to-multi because you are probably mostly losing the "spatial indexing" capabilities of sf objects, which considerably speed-up intersect operations (see http://r-spatial.org/r/2017/06/22/spatial-index.html). (Also note that I substituted transmute' withmutate` - also that was introducing some overhead).

HTH

这篇关于如何在dplyr :: mutate()中加快空间运算?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆