之间的距离坐标Python与R的计算时间 [英] Distance betweeen coordinates Python vs R computation time

查看:79
本文介绍了之间的距离坐标Python与R的计算时间的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试计算 WGS84 椭球体上的一个点与许多其他点之间的距离-不是Haversine近似值,如其他答案所述.我想用Python来做,但是相对于R来说,计算时间很长.我下面的Python脚本花费了将近23秒,而R中的等效脚本花费了0.13秒.有什么建议可以加快我的python代码的速度吗?

I am trying to calculate the distance between one point and many others on a WGS84 ellipsoid - not the haversine approximation as explained in other answers. I would like to do it in Python but the computation time is very long with respect to R. My Python script below takes almost 23 seconds while the equivalent one in R takes 0.13 seconds. Any suggestion for speeding up my python code?

Python脚本:

import numpy as np
import pandas as pd
import xarray as xr
from geopy.distance import geodesic
from timeit import default_timer as timer
df = pd.DataFrame()
city_coord_orig = (4.351749, 50.845701) 
city_coord_orig_r = tuple(reversed(city_coord_orig))
N = 100000
np.random.normal()
df['or'] = [city_coord_orig_r] * N
df['new'] = df.apply(lambda x: (x['or'][0] + np.random.normal(), x['or'][1] + np.random.normal()), axis=1)
start = timer()
df['d2city2'] = df.apply(lambda x: geodesic(x['or'], x['new']).km, axis=1)
end = timer()
print(end - start)

R脚本

# clean up
rm(list = ls())
# read libraries
library(geosphere)

city.coord.orig <- c(4.351749, 50.845701)
N<-100000
many <- data.frame(x=rep(city.coord.orig[1], N) + rnorm(N), 
                   y=rep(city.coord.orig[2], N) + rnorm(N))
city.coord.orig <- c(4.351749, 50.845701)
start_time <- Sys.time()
many$d2city <- distGeo(city.coord.orig, many[,c("x","y")]) 
end_time <- Sys.time()
end_time - start_time

推荐答案

您正在使用 .apply(),该代码使用一个简单的循环为每一行运行函数.距离计算完全在Python中完成( geopy 使用 geographiclib 似乎只用Python编写).非矢量化的距离计算很慢,您需要的是使用编译后的代码的矢量化解决方案,就像计算Haversine距离.

You are using .apply(), which uses a simple loop to run your function for each and every row. The distance calculation is done entirely in Python (geopy uses geographiclib which appears to be written in Python only). Non-vectorised distance calculations are slow, what you need is a vectorised solution using compiled code, just like when calculating the Haversine distance.

pyproj 提供经过验证的WSG84距离计算( pyproj.Geod class 接受numpy数组)并包装 PROJ4库,这意味着它将在本机代码中运行以下计算:

pyproj offers verctorised WSG84 distance calculations (the methods of the pyproj.Geod class accept numpy arrays) and wraps the PROJ4 library, meaning it runs these calculations in native machine code:

from pyproj import Geod

# split out coordinates into separate columns
df[['or_lat', 'or_lon']] = pd.DataFrame(df['or'].tolist(), index=df.index)
df[['new_lat', 'new_lon']] = pd.DataFrame(df['new'].tolist(), index=df.index)

wsg84 = Geod(ellps='WGS84')
# numpy matrix of the lon / lat columns, iterable in column order
or_and_new = df[['or_lon', 'or_lat', 'new_lon', 'new_lat']].to_numpy().T
df['d2city2'] = wsg84.inv(*or_and_new)[-1] / 1000  # as km

这在更好的时间进行:

>>> from timeit import Timer
>>> count, total = Timer(
...     "wsg84.inv(*df[['or_lon', 'or_lat', 'new_lon', 'new_lat']].to_numpy().T)[-1] / 1000",
...     'from __main__ import wsg84, df'
... ).autorange()
>>> total / count * 10 ** 3  # milliseconds
66.09873340003105

66毫秒来计算100k距离,还不错!

66 milliseconds to calculate 100k distances, not bad!

为达到比较目的,这是同一台计算机上的 geopy / df.apply()版本:

To make the comparison objective, here is your geopy / df.apply() version on the same machine:

>>> count, total = Timer("df.apply(lambda x: geodesic(x['or'], x['new']).km, axis=1)", 'from __main__ import geodesic, df').autorange()
>>> total / count * 10 ** 3  # milliseconds
25844.119450000107

25.8秒,甚至不在同一个球场.

25.8 seconds, not even in the same ballpark.

这篇关于之间的距离坐标Python与R的计算时间的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆