快速Haversine近似(Python/Pandas) [英] Fast Haversine Approximation (Python/Pandas)
问题描述
Pandas数据框中的每一行都包含2点的经/纬度坐标.使用下面的Python代码,计算许多(几百万)行的这两个点之间的距离会花费很长时间!
Each row in a Pandas dataframe contains lat/lng coordinates of 2 points. Using the Python code below, calculating the distances between these 2 points for many (millions) of rows takes a very long time!
考虑到两个点相距50英里以内,准确性不是很重要,是否可以使计算速度更快?
Considering that the 2 points are under 50 miles apart and accuracy is not very important, is it possible to make the calculation faster?
from math import radians, cos, sin, asin, sqrt
def haversine(lon1, lat1, lon2, lat2):
"""
Calculate the great circle distance between two points
on the earth (specified in decimal degrees)
"""
# convert decimal degrees to radians
lon1, lat1, lon2, lat2 = map(radians, [lon1, lat1, lon2, lat2])
# haversine formula
dlon = lon2 - lon1
dlat = lat2 - lat1
a = sin(dlat/2)**2 + cos(lat1) * cos(lat2) * sin(dlon/2)**2
c = 2 * asin(sqrt(a))
km = 6367 * c
return km
for index, row in df.iterrows():
df.loc[index, 'distance'] = haversine(row['a_longitude'], row['a_latitude'], row['b_longitude'], row['b_latitude'])
推荐答案
以下是该函数的矢量化numpy版本:
Here is a vectorized numpy version of the same function:
import numpy as np
def haversine_np(lon1, lat1, lon2, lat2):
"""
Calculate the great circle distance between two points
on the earth (specified in decimal degrees)
All args must be of equal length.
"""
lon1, lat1, lon2, lat2 = map(np.radians, [lon1, lat1, lon2, lat2])
dlon = lon2 - lon1
dlat = lat2 - lat1
a = np.sin(dlat/2.0)**2 + np.cos(lat1) * np.cos(lat2) * np.sin(dlon/2.0)**2
c = 2 * np.arcsin(np.sqrt(a))
km = 6367 * c
return km
输入都是值的数组,它应该能够立即完成数百万个点.要求输入是ndarrays,但您的pandas表中的列将起作用.
The inputs are all arrays of values, and it should be able to do millions of points instantly. The requirement is that the inputs are ndarrays but the columns of your pandas table will work.
例如,使用随机生成的值:
For example, with randomly generated values:
>>> import numpy as np
>>> import pandas
>>> lon1, lon2, lat1, lat2 = np.random.randn(4, 1000000)
>>> df = pandas.DataFrame(data={'lon1':lon1,'lon2':lon2,'lat1':lat1,'lat2':lat2})
>>> km = haversine_np(df['lon1'],df['lat1'],df['lon2'],df['lat2'])
或者如果您要创建另一列:
Or if you want to create another column:
>>> df['distance'] = haversine_np(df['lon1'],df['lat1'],df['lon2'],df['lat2'])
在python中遍历数据数组非常慢. Numpy提供的功能可对整个数据数组进行操作,从而使您避免循环并大大提高性能.
Looping through arrays of data is very slow in python. Numpy provides functions that operate on entire arrays of data, which lets you avoid looping and drastically improve performance.
这是向量化的示例.
这篇关于快速Haversine近似(Python/Pandas)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!