如何使用data.table有效地计算一个数据集中的GPS点与另一个数据集中的GPS点之间的距离 [英] How to efficiently calculate distance between GPS points in one dataset and GPS points in another data set using data.table

查看:64
本文介绍了如何使用data.table有效地计算一个数据集中的GPS点与另一个数据集中的GPS点之间的距离的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在R中遇到编码(优化)问题。我有一个较长的数据集,具有GPS坐标(lon,lat,timestamp),对于每一行,我需要检查位置是否在公交车站附近。我有一个.csv文件,所有的公交车站(在荷兰)。 GPS坐标文件的长度为数百万个条目,但可以根据需要进行拆分。公交车站数据集的长度约为5500个条目。

使用以下页面上给出的代码和提示,尤其是这些页面:

I am facing a coding (optimization) problem in R. I have a long data set with GPS coordinates (lon, lat, timestamp) and for every row I need to check whether the location is near a bus stop. I have a .csv file with all the bus stops (in the Netherlands). The GPS coordinates file is millions of entries long, but could be split if necessary. The bus stop dataset is around 5500 entries long.
Using the code and tips given on, inter alia, these pages:

1)如何使用以下方法有效地计算一对坐标之间的距离data.table:=

2)在空间数据上使用简单的for循环

3)计算两个纬度经度点之间的距离? (Haversine公式)

4)从数百万个GPS坐标中确定国家/地区的最快方法[R]

来构建一个有效但很慢的代码。我想知道是有人可以帮助我实现更快的data.table()实现,还是可以指出我的代码瓶颈在哪里?是spDistsN1()函数,还是apply和melt()函数的组合?我最喜欢R,但是可以使用其他软件(只要它是开源的即可)。

I was able to construct a code that works, but is (too) slow. I was wondering if someone can help me with a faster data.table() implementation or can point out where the bottle neck in my code is? Is it the spDistsN1() function, or maybe the apply and melt() functions combination? I am most comfortable in R, but open to other software (as long as it is open source).

由于隐私问题,我无法上传完整的数据集,但这是一个(小)可复制的示例,与真实数据的外观没有太大差异。

Due to privacy concerns I cannot upload the full dataset, but this is a (small) reproducible example that is not too different from how the real data looks.

# packages:
library(data.table)
library(tidyverse)
library(sp)


# create GPS data
number_of_GPS_coordinates <- 20000
set.seed(1)
gpsdata<-as.data.frame(cbind(id=1:number_of_GPS_coordinates, 
                             lat=runif(number_of_GPS_coordinates,50.5,53.5), 
                             lon=runif(number_of_GPS_coordinates,4,7)))

# create some busstop data. In this case only 2000 bus stops
set.seed(1)
number_of_bus_stops <- 2000
stop<-as.data.frame(gpsdata[sample(nrow(gpsdata), number_of_bus_stops), -1]) # of course do not keep id variable
stop$lat<-stop$lat+rnorm(number_of_bus_stops,0,.0005)
stop$lon<-stop$lon+rnorm(number_of_bus_stops,0,.0005)
busdata.data<-cbind(stop, name=replicate(number_of_bus_stops, paste(sample(LETTERS, 15, replace=TRUE), collapse="")))

names(busdata.data) <- c("latitude_bustops",  "longitude_bustops", "name")

如果需要,可以下载实际的公交车站数据,这种方式很难再现其随机样本。

Download the real bus stop data if you want, kind of hard to reproduce a random sample of this.

#temp <- tempfile()
#download.file("http://data.openov.nl/haltes/stops.csv.gz", temp) #1.7MB
#gzfile(temp, 'rt')
#busstopdata <- read.csv(temp, stringsAsFactors = FALSE)
#unlink(temp)
#bus_stops <- fread("bus_stops.csv")
#busdata.data <- busstopdata %>%
#  mutate(latitude_bustops = latitude)%>%
#  mutate(longitude_bustops = longitude)%>%
#  dplyr::select(name, latitude_bustops,  longitude_bustops)

我现在使用代码来计算距离。它可以工作,但速度很慢

Code I use now to calculate distances. It works but it is pretty slow

countDataPoints3 <- function(p) {
  distances <- spDistsN1(data.matrix(gpsdata[,c("lon","lat")]), 
                         p,
                         longlat=TRUE) # in km
  return(which(distances <= .2)) # distance is now set to 200 meters
}


# code to check per data point if a bus stop is near and save this per bus stop in a list entry
datapoints.by.bustation       <- apply(data.matrix(busdata.data[,c("longitude_bustops","latitude_bustops")]), 1, countDataPoints3)


# rename list entries
names(datapoints.by.bustation) <- busdata.data$name

# melt list into one big data.frame
long.data.frame.busstops       <- melt(datapoints.by.bustation)

# now switch to data.table grammar to speed up process
# set data.table
setDT(gpsdata)
gpsdata[, rowID := 1:nrow(gpsdata)]
setkey(gpsdata, key = "rowID")
setDT(long.data.frame.busstops)

# merge the data, and filter non-unique entries 
setkey(long.data.frame.busstops, key = "value")
GPS.joined        <- merge(x = gpsdata, y = long.data.frame.busstops, by.x= "rowID", by.y= "value", all.x=TRUE)
GPS.joined.unique <- unique(GPS.joined, by="id") # mak

# this last part of the code is needed to make sure that if there are more than 1 bus stop nearby it puts these bus stop in a list
# instead of adding row and making the final data.frame longer than the original one
GPS.joined.unique2 <- setDT(GPS.joined.unique)[order(id, L1), list(L1=list(L1)), by=id]
GPS.joined.unique2[, nearby := TRUE][is.na(L1), nearby := FALSE] # add a dummy to check if any bus stop is nearby.

# makes sense:
as.tibble(GPS.joined.unique2) %>%
  summarize(sum = sum(nearby)) 


推荐答案

请考虑使用切片方法进行切割:首先按接近的纬度和接近的经度进行切割。在这种情况下,纬度为0.5,经度为0.5(仍然是大约60 km的圆盘)。我们可以使用 data.table 出色的滚动连接支持。

Consider cutting using an slicing method: first cut by close latitudes and close longitudes. In this case 0.5 latitude and 0.5 longitude (which is still about a 60 km disc). We can use data.table's superb support of rolling joins.

以下过程需要花费几毫秒的时间才能达到20,000条目,而2M条目只需几秒钟。

The following takes a few milliseconds for 20,000 entries and only a few seconds for 2M entries.

library(data.table)
library(hutils)
setDT(gpsdata)
setDT(busdata.data)

gps_orig <- copy(gpsdata)
busdata.orig <- copy(busdata.data)

setkey(gpsdata, lat)

# Just to take note of the originals
gpsdata[, gps_lat := lat + 0]
gpsdata[, gps_lon := lon + 0]

busdata.data[, lat := latitude_bustops + 0]
busdata.data[, lon := longitude_bustops + 0]


setkey(busdata.data, lat)

gpsID_by_lat <- 
  gpsdata[, .(id), keyby = "lat"]


By_latitude <- 
  busdata.data[gpsdata, 
               on = "lat",

               # within 0.5 degrees of latitude
               roll = 0.5, 
               # +/-
               rollends = c(TRUE, TRUE),

               # and remove those beyond 0.5 degrees
               nomatch=0L] %>%
  .[, .(id_lat = id,
        name_lat = name,
        bus_lat = latitude_bustops,
        bus_lon = longitude_bustops,
        gps_lat,
        gps_lon),
    keyby = .(lon = gps_lon)]

setkey(busdata.data, lon)

By_latlon <-
  busdata.data[By_latitude,
               on = c("name==name_lat", "lon"),

               # within 0.5 degrees of latitude
               roll = 0.5, 
               # +/-
               rollends = c(TRUE, TRUE),
               # and remove those beyond 0.5 degrees
               nomatch=0L]

By_latlon[, distance := haversine_distance(lat1 = gps_lat, 
                                           lon1 = gps_lon,
                                           lat2 = bus_lat,
                                           lon2 = bus_lon)]

By_latlon[distance < 0.2]

这篇关于如何使用data.table有效地计算一个数据集中的GPS点与另一个数据集中的GPS点之间的距离的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆