如何在dplyr :: mutate()中加快空间运算? [英] How can I speed up spatial operations in `dplyr::mutate()`?
问题描述
我正在使用 sf
包与 dplyr
和 purrr
。
我希望在 mutate
调用内执行空间运算,像这样:
simple_feature%&%;%
mutate(geometry_area = map_dbl(geometry,〜as.double(st_area(.x))))
我喜欢这种方法,允许我使用%&%;%
和 mutate
进行一系列空间运算。 / p>
我不喜欢,这种方法似乎会大大增加 sf
的运行时间功能(有时令人望而却步),我希望听到有关如何克服这种速度损失的建议。
这里是一个详细说明速度损失问题的代表。
>
请注意:这不是最小示例,需要从ESRI REST API下载一些软件包和一个文件。希望
对我好;)
此示例中的目标是添加一个新列,指示如图所示,每个北卡罗来纳州县( nc
)是否与任何水域多边形( nc_wtr
)相交下方:
我创建了一个执行此计算的函数: st_intersects_any()
然后我进行基准测试函数对两个数据集( nc
和 nc_1e4
)起作用,首先使用 st_intersects_any()
本身,然后在 mutate
调用中使用它。
## |测试|淘汰|
## |:------------------ | --------:|
## | bm_sf_small | 0.01 |
## | bm_sf_dplyr_small | 1.22 |
## | bm_sf_large | 0.95 |
## | bm_sf_dplyr_large | 122.88 |
基准清楚地表明 dplyr
方法速度要慢得多,我希望有人建议减少或消除这种速度损失,同时仍然使用 dplyr
方法。
如果使用 data.table
或其他应检查的方法有明显更快的方法
谢谢!
Reprex
#设置----
库(lwgeom)# devtools :: install_github('r-spatial / lwgeom)
库(tidyverse)
库(sf)
库(esri2sf)#devtools :: install_github('yonghah / esri2sf')
库(rbenchmark)
库(knitr)
#创建新的sf函数:st_intersects_any ----
st_intersects_any<-函数(x, y){
st_intersects(x,y)%&%;%
map_lgl(〜length(.x)> 0)
}
#加载数据- ---
#NC县
nc<-read_sf(system.file( shape / nc.shp,package = sf))%>%
st_transform(32119)
nc_1e4<-list(nc)%>%
rep(times = 1e2)%&%;%
reduce(rbind)
#NC分水岭
url<- https://services.nconemap.gov/secure/rest/services/NC1Map_Watersheds/MapServer/2
nc_wtr<-esri2sf(url)
##警告:软件包 httr是在R版本3.4.2下构建的
##
##附加软件包:'jsonlite'
##以下对象从'package:purrr'中屏蔽:
##
##展平
## [1]功能层
## [1] esriGeometryPolygon
nc_wtr<-st_transform(nc_wtr,32119)%>%
st_simplify(dTolerance = 100)#简化水体几何图形
#绘制图数据
par(mar = rep(.1,4))
plot(st_geometry(nc),lwd = 1)
plot(st_geometry(nc_wtr),col = alpha ( blue,.3),lwd = 1.5,add = TRUE)
#比较这两种方法
cols <-c( elapsed, relative )
bm_sf_small<-基准({
st_intersects_any(nc,nc_wtr)
},column = cols,复制= 1)
bm_sf_dplyr_small <-基准({
n c%>%transmute(INT = map_lgl(geometry,st_intersects_any,y = nc_wtr))
},column = cols,复制= 1)
##警告:软件包'bindrcpp'是在R下构建的版本3.4.2
bm_sf_large<-基准({
st_intersects_any(nc_1e4,nc_wtr)
},column = cols,复制= 1)
bm_sf_dplyr_large<-基准({
nc_1e4%>%transmute(INT = map_lgl(geometry,st_intersects_any,y = nc_wtr))
},column = cols,复制= 1)
测试<-列表(bm_sf_small,bm_sf_dplyr_small,bm_sf_large,bm_sf_dplyr_large)
tbl<-tibble(
TEST = c( bm_sf_small, bm_sf_d_bm , bm_sf_dplyr_large),
ELAPSED = map_dbl(测试,经过)
)
kable(tbl,format = markdown,padding = 2)
## |测试|淘汰|
## |:------------------ | --------:|
## | bm_sf_small | 0.01 |
## | bm_sf_dplyr_small | 1.22 |
## | bm_sf_large | 0.95 |
## | bm_sf_dplyr_large | 122.88 |
devtools :: session_info()
##会话信息------------ -------------------------------------------------
##设置值
##版本R版本3.4.0(2017-04-21)
##系统x86_64,mingw32
## ui RTerm
##语言(ZH)
##整理English_United States.1252
## tz美国/洛杉矶
##日期2018-01-31
##套餐------ -------------------------------------------------- ---------
##软件包*版本日期来源
##断言0.2.0 2017-04-11 CRAN(R 3.4.2)
## backports 1.1 .0 2017-05-22 CRAN(R 3.4.0)
##基本* 3.4.0 2017-04-21本地
##绑定器0.1 2016-11-13 CRAN(R 3.4.2 )
## bindrcp p * 0.2 2017-06-17 CRAN(R 3.4.2)
##扫帚0.4.3 2017-11-20 CRAN(R 3.4.3)
## cellranger 1.1.0 2016-07 -27 CRAN(R 3.4.2)
## class 7.3-14 2015-08-30 CRAN(R 3.4.0)
## classInt 0.1-24 2017-04-16 CRAN(R 3.4 .2)
## cli 1.0.0 2017-11-05 CRAN(R 3.4.2)
## colorspace 1.3-2 2016-12-14 CRAN(R 3.4.2)
##编译器3.4.0 2017-04-21本地
##蜡笔1.3.4 2017-10-30 Github(r-lib / crayon @ b5221ab)
## curl 3.0 2017-10- 06 CRAN(R 3.4.2)
##数据集* 3.4.0 2017-04-21本地
## DBI 0.7 2017-06-18 CRAN(R 3.4.2)
# #devtools 1.13.2 2017-06-02 CRAN(R 3.4 .0)
##摘要0.6.13 2017-12-14 CRAN(R 3.4.3)
## dplyr * 0.7.4 2017-09-28 CRAN(R 3.4.2)
## e1071 1.6-8 2017-02-02 CRAN(R 3.4.2)
## esri2sf * 0.1.0 2017-12-12 Github(yonghah / esri2sf @ 81d211f)
##评估0.10.1 2017-06-24 CRAN(R 3.4.3)
## forcats * 0.2.0 2017-01-23 CRAN(R 3.4.3)
##国外0.8-67 2016 -09-13 CRAN(R 3.4.0)
## ggplot2 * 2.2.1.9000 2017-12-02 Github(tidyverse / ggplot2 @ 7b5c185)
##胶水1.2.0.9000 2018-01-13 Github(tidyverse / glue @ 1592ee1)
##图形* 3.4.0 2017-04-21本地
## grDevices * 3.4.0 2017-04-21本地
##网格3.4 .0 2017-04-21本地
## gtable 0.2.0 2016-02-26 CRAN(R 3.4.2)
##避风港1.1.0 2017-07-09 CRAN(R 3.4.2)
## hms 0.4.0 2017-11 -23 CRAN(R 3.4.3)
## htmltools 0.3.6 2017-04-28 CRAN(R 3.4.0)
## httr * 1.3.1 2017-08-20 CRAN(R 3.4.2)
## jsonlite * 1.5 2017-06-01 CRAN(R 3.4.0)
## knitr 1.18 2017-12-27 CRAN(R 3.4.3)
# #晶格0.20-35 2017-03-25 CRAN(R 3.4.0)
## lazyeval 0.2.1 2017-10-29 CRAN(R 3.4.2)
## lubridate 1.7.1 2017 -11-03 CRAN(R 3.4.2)
## lwgeom * 0.1-1 2017-12-16 Github(r-spatial / lwgeom @ baf22c6)
## magrittr 1.5 2014-11-22 CRAN(R 3.4.0)
##备忘录1.1.0 2017-04-21 CRAN(R 3.4 .0)
##方法* 3.4.0 2017-04-21本地
## mnormt 1.5-5 2016-10-15 CRAN(R 3.4.1)
## modelr 0.1 .1 2017-07-24 CRAN(R 3.4.2)
##孟塞尔0.4.3 2016-02-13 CRAN(R 3.4.2)
## nlme 3.1-131 2017-02- 06 CRAN(R 3.4.0)
##平行3.4.0 2017-04-21本地
##支柱1.0.99.9001 2018-01-16 Github(r-lib / pillar @ 9d96835)
## pkgconfig 2.0.1 2017-03-21 CRAN(R 3.4.2)
## plyr 1.8.4 2016-06-08 CRAN(R 3.4.2)
## psych 1.7.8 2017-09-09 CRAN(R 3.4.2)
## purrr * 0.2.4.9000 2017-12-05 Github(tidyverse / purrr @ 62b135a)
## R6 2.2.2 2017年-06-17 CRAN(R 3.4.0)
## rbenchm方舟* 1.0.0 2012-08-30 CRAN(R 3.4.1)
## Rcpp 0.12.15 2018-01-20 CRAN(R 3.4.3)
##读取器* 1.1.1 2017-05-16 CRAN(R 3.4.2)
## readxl 1.0.0 2017-04-18 CRAN(R 3.4.2)
## reshape2 1.4.2 2016-10-22 CRAN (R 3.4.2)
## rlang 0.1.6 2017-12-21 CRAN(R 3.4.3)
## rmarkdown 1.8 2017-11-17 CRAN(R 3.4.2)
## rprojroot 1.3-2 2018-01-03 CRAN(R 3.4.3)
## rvest 0.3.2 2016-06-17 CRAN(R 3.4.2)
##可缩放0.5 .0.9000 2017-12-02 Github(hadley / scales @ d767915)
## sf * 0.6-1 2018-01-24 Github(r-spatial / sf @ 7ea67a5)
##统计* 3.4 .0 2017-04-21本地
##字符串1.1.6 2017-11-17 CRAN(R 3.4 .2)
##字符串* 1.2.0 2017-02-18 CRAN(R 3.4.0)
## tibble * 1.4.1.9000 2018-01-18 Github(tidyverse / tibble @ 64fedbd)
## tidyr * 0.7.2.9000 2018-01-13 Github(tidyverse / tidyr @ 74bd48f)
## tidyverse * 1.2.1 2017-11-14 CRAN(R 3.4.3)
##工具3.4.0 2017-04-21本地
## udunits2 0.13 2016-11-17 CRAN(R 3.4.1)
##单位0.5-1 2018-01-08 CRAN( R 3.4.3)
## utf8 1.1.3 2018-01-03 CRAN(R 3.4.3)
## utils * 3.4.0 2017-04-21本地
## withr 2.1.1.9000 2018-01-13 Github(jimhester / withr @ df18523)
## xml2 1.1.1 2017-01-24 CRAN(R 3.4.2)
## yaml 2.1.14 2016年-11-12 CRAN(R 3.4.0)
您只需将不必要的 map_lgl
调用放在管道中即可:
bm_sf_dplyr_large_fast<-基准({
int_new<-nc_1e4% >%mutate(INT = st_intersects_any(。,nc_wtr))
},列= cols,复制= 1)
bm_sf_dplyr_large_fast
#bm_sf_dplyr_large_fast
#
#1 0.829 1
巨大的减速取决于因为在这种情况下映射到几何行是有害的,因为然后进行一个环状的一对多多边形相交。
除了通过子集引入的开销之外,我认为这要比直接多对多慢得多,因为您可能大多失去了 sf
的空间索引功能对象,大大加快了相交操作(请参见< a href = http://r-spatial.org/r/2017/06/22/spatial-index.html rel = nofollow noreferrer> http://r-spatial.org/r/2017/06 /22/spatial-index.html )。 (还请注意,我用 mutate`代替了
transmute'-这也带来了一些开销)。
HTH
I am working on a spatial problem using the sf
package in conjunction with dplyr
and purrr
.
I would prefer to perform spatial operations inside a mutate
call, like so:
simple_feature %>%
mutate(geometry_area = map_dbl(geometry, ~ as.double(st_area(.x))))
I like that this approach allows me to run a series of spatial operations using %>%
and mutate
.
I dislike that this approach seems to significantly increase the run-time of the sf
functions (sometimes prohibitively) and I would appreciate hearing suggestions about how to overcome this speed loss.
Here is a reprex that illustrates the speed loss problem in detail.
Please note: this is not a minimal example and requires downloading a few packages and one file from an ESRI REST API. I hope you'll be kind with me ;)
The objective in this example is to add a new column indicating whether each North Carolina county (nc
) intersects with any of the waterbodies polygons (nc_wtr
), as shown in the image below:
I created a function that performs this calculation: st_intersects_any()
Then I benchmark that function on two datasets (nc
and nc_1e4
), first using st_intersects_any()
by itself and then using it inside a mutate
call.
## |TEST | ELAPSED|
## |:------------------|--------:|
## |bm_sf_small | 0.01|
## |bm_sf_dplyr_small | 1.22|
## |bm_sf_large | 0.95|
## |bm_sf_dplyr_large | 122.88|
The benchmarks clearly show that the dplyr
approach is substantially slower, and I'm hoping that someone has a suggestion for reducing or eliminating this speed loss while still using the dplyr
approach.
If there are significantly faster ways to do this using data.table
or some other method that I should check out please let me know about those as well.
Thanks!
Reprex
# Setup ----
library(lwgeom) # devtools::install_github('r-spatial/lwgeom)
library(tidyverse)
library(sf)
library(esri2sf) # devtools::install_github('yonghah/esri2sf')
library(rbenchmark)
library(knitr)
# Create the new sf function: st_intersects_any ----
st_intersects_any <- function(x, y) {
st_intersects(x, y) %>%
map_lgl(~ length(.x) > 0)
}
# Load data ----
# NC counties
nc <- read_sf(system.file("shape/nc.shp", package = "sf")) %>%
st_transform(32119)
nc_1e4 <- list(nc) %>%
rep(times = 1e2) %>%
reduce(rbind)
# NC watersheds
url <- "https://services.nconemap.gov/secure/rest/services/NC1Map_Watersheds/MapServer/2"
nc_wtr <- esri2sf(url)
## Warning: package 'httr' was built under R version 3.4.2
##
## Attaching package: 'jsonlite'
## The following object is masked from 'package:purrr':
##
## flatten
## [1] "Feature Layer"
## [1] "esriGeometryPolygon"
nc_wtr <- st_transform(nc_wtr, 32119) %>%
st_simplify(dTolerance = 100) # simplify the waterbodies geometries
# plot the data
par(mar = rep(.1, 4))
plot(st_geometry(nc), lwd = 1)
plot(st_geometry(nc_wtr), col = alpha("blue", .3), lwd = 1.5, add = TRUE)
# Benchmark the two approaches
cols <- c("elapsed", "relative")
bm_sf_small <- benchmark({
st_intersects_any(nc, nc_wtr)
}, columns = cols, replications = 1)
bm_sf_dplyr_small <- benchmark({
nc %>% transmute(INT = map_lgl(geometry, st_intersects_any, y = nc_wtr))
}, columns = cols, replications = 1)
## Warning: package 'bindrcpp' was built under R version 3.4.2
bm_sf_large <- benchmark({
st_intersects_any(nc_1e4, nc_wtr)
}, columns = cols, replications = 1)
bm_sf_dplyr_large <- benchmark({
nc_1e4 %>% transmute(INT = map_lgl(geometry, st_intersects_any, y = nc_wtr))
}, columns = cols, replications = 1)
tests <- list(bm_sf_small, bm_sf_dplyr_small, bm_sf_large, bm_sf_dplyr_large)
tbl <- tibble(
TEST = c("bm_sf_small", "bm_sf_dplyr_small", "bm_sf_large", "bm_sf_dplyr_large"),
ELAPSED = map_dbl(tests, "elapsed")
)
kable(tbl,format = "markdown", padding = 2)
## |TEST | ELAPSED|
## |:------------------|--------:|
## |bm_sf_small | 0.01|
## |bm_sf_dplyr_small | 1.22|
## |bm_sf_large | 0.95|
## |bm_sf_dplyr_large | 122.88|
devtools::session_info()
## Session info -------------------------------------------------------------
## setting value
## version R version 3.4.0 (2017-04-21)
## system x86_64, mingw32
## ui RTerm
## language (EN)
## collate English_United States.1252
## tz America/Los_Angeles
## date 2018-01-31
## Packages -----------------------------------------------------------------
## package * version date source
## assertthat 0.2.0 2017-04-11 CRAN (R 3.4.2)
## backports 1.1.0 2017-05-22 CRAN (R 3.4.0)
## base * 3.4.0 2017-04-21 local
## bindr 0.1 2016-11-13 CRAN (R 3.4.2)
## bindrcpp * 0.2 2017-06-17 CRAN (R 3.4.2)
## broom 0.4.3 2017-11-20 CRAN (R 3.4.3)
## cellranger 1.1.0 2016-07-27 CRAN (R 3.4.2)
## class 7.3-14 2015-08-30 CRAN (R 3.4.0)
## classInt 0.1-24 2017-04-16 CRAN (R 3.4.2)
## cli 1.0.0 2017-11-05 CRAN (R 3.4.2)
## colorspace 1.3-2 2016-12-14 CRAN (R 3.4.2)
## compiler 3.4.0 2017-04-21 local
## crayon 1.3.4 2017-10-30 Github (r-lib/crayon@b5221ab)
## curl 3.0 2017-10-06 CRAN (R 3.4.2)
## datasets * 3.4.0 2017-04-21 local
## DBI 0.7 2017-06-18 CRAN (R 3.4.2)
## devtools 1.13.2 2017-06-02 CRAN (R 3.4.0)
## digest 0.6.13 2017-12-14 CRAN (R 3.4.3)
## dplyr * 0.7.4 2017-09-28 CRAN (R 3.4.2)
## e1071 1.6-8 2017-02-02 CRAN (R 3.4.2)
## esri2sf * 0.1.0 2017-12-12 Github (yonghah/esri2sf@81d211f)
## evaluate 0.10.1 2017-06-24 CRAN (R 3.4.3)
## forcats * 0.2.0 2017-01-23 CRAN (R 3.4.3)
## foreign 0.8-67 2016-09-13 CRAN (R 3.4.0)
## ggplot2 * 2.2.1.9000 2017-12-02 Github (tidyverse/ggplot2@7b5c185)
## glue 1.2.0.9000 2018-01-13 Github (tidyverse/glue@1592ee1)
## graphics * 3.4.0 2017-04-21 local
## grDevices * 3.4.0 2017-04-21 local
## grid 3.4.0 2017-04-21 local
## gtable 0.2.0 2016-02-26 CRAN (R 3.4.2)
## haven 1.1.0 2017-07-09 CRAN (R 3.4.2)
## hms 0.4.0 2017-11-23 CRAN (R 3.4.3)
## htmltools 0.3.6 2017-04-28 CRAN (R 3.4.0)
## httr * 1.3.1 2017-08-20 CRAN (R 3.4.2)
## jsonlite * 1.5 2017-06-01 CRAN (R 3.4.0)
## knitr 1.18 2017-12-27 CRAN (R 3.4.3)
## lattice 0.20-35 2017-03-25 CRAN (R 3.4.0)
## lazyeval 0.2.1 2017-10-29 CRAN (R 3.4.2)
## lubridate 1.7.1 2017-11-03 CRAN (R 3.4.2)
## lwgeom * 0.1-1 2017-12-16 Github (r-spatial/lwgeom@baf22c6)
## magrittr 1.5 2014-11-22 CRAN (R 3.4.0)
## memoise 1.1.0 2017-04-21 CRAN (R 3.4.0)
## methods * 3.4.0 2017-04-21 local
## mnormt 1.5-5 2016-10-15 CRAN (R 3.4.1)
## modelr 0.1.1 2017-07-24 CRAN (R 3.4.2)
## munsell 0.4.3 2016-02-13 CRAN (R 3.4.2)
## nlme 3.1-131 2017-02-06 CRAN (R 3.4.0)
## parallel 3.4.0 2017-04-21 local
## pillar 1.0.99.9001 2018-01-16 Github (r-lib/pillar@9d96835)
## pkgconfig 2.0.1 2017-03-21 CRAN (R 3.4.2)
## plyr 1.8.4 2016-06-08 CRAN (R 3.4.2)
## psych 1.7.8 2017-09-09 CRAN (R 3.4.2)
## purrr * 0.2.4.9000 2017-12-05 Github (tidyverse/purrr@62b135a)
## R6 2.2.2 2017-06-17 CRAN (R 3.4.0)
## rbenchmark * 1.0.0 2012-08-30 CRAN (R 3.4.1)
## Rcpp 0.12.15 2018-01-20 CRAN (R 3.4.3)
## readr * 1.1.1 2017-05-16 CRAN (R 3.4.2)
## readxl 1.0.0 2017-04-18 CRAN (R 3.4.2)
## reshape2 1.4.2 2016-10-22 CRAN (R 3.4.2)
## rlang 0.1.6 2017-12-21 CRAN (R 3.4.3)
## rmarkdown 1.8 2017-11-17 CRAN (R 3.4.2)
## rprojroot 1.3-2 2018-01-03 CRAN (R 3.4.3)
## rvest 0.3.2 2016-06-17 CRAN (R 3.4.2)
## scales 0.5.0.9000 2017-12-02 Github (hadley/scales@d767915)
## sf * 0.6-1 2018-01-24 Github (r-spatial/sf@7ea67a5)
## stats * 3.4.0 2017-04-21 local
## stringi 1.1.6 2017-11-17 CRAN (R 3.4.2)
## stringr * 1.2.0 2017-02-18 CRAN (R 3.4.0)
## tibble * 1.4.1.9000 2018-01-18 Github (tidyverse/tibble@64fedbd)
## tidyr * 0.7.2.9000 2018-01-13 Github (tidyverse/tidyr@74bd48f)
## tidyverse * 1.2.1 2017-11-14 CRAN (R 3.4.3)
## tools 3.4.0 2017-04-21 local
## udunits2 0.13 2016-11-17 CRAN (R 3.4.1)
## units 0.5-1 2018-01-08 CRAN (R 3.4.3)
## utf8 1.1.3 2018-01-03 CRAN (R 3.4.3)
## utils * 3.4.0 2017-04-21 local
## withr 2.1.1.9000 2018-01-13 Github (jimhester/withr@df18523)
## xml2 1.1.1 2017-01-24 CRAN (R 3.4.2)
## yaml 2.1.14 2016-11-12 CRAN (R 3.4.0)
you can considerably speed-up this by simply dropping the unnecessary map_lgl
call in the pipe:
bm_sf_dplyr_large_fast <- benchmark({
int_new <- nc_1e4 %>% mutate(INT = st_intersects_any(., nc_wtr))
}, columns = cols, replications = 1)
bm_sf_dplyr_large_fast
# bm_sf_dplyr_large_fast
# elapsed relative
# 1 0.829 1
The huge slow down depends from the fact that mapping over geometry rows is in this case detrimental, because you then do a looped one-to-multi polygon intersection.
Besides the overhead introduced by subsetting, I believe this is much slower than a straight-on multi-to-multi because you are probably mostly losing the "spatial indexing" capabilities of sf
objects, which considerably speed-up intersect operations (see http://r-spatial.org/r/2017/06/22/spatial-index.html). (Also note that I substituted transmute' with
mutate` - also that was introducing some overhead).
HTH
这篇关于如何在dplyr :: mutate()中加快空间运算?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!