有效计算Haversine距离的最小值 [英] Efficient computation of minimum of Haversine distances
问题描述
我有一个带有 >2.7MM 坐标的数据框,以及一个包含 ~2,000 个坐标的单独列表.与列表中的每个坐标相比,我试图返回每一行中坐标之间的最小距离.以下代码适用于小规模(具有 200 行的数据帧),但是当计算超过 2.7MM 的行时,它似乎永远运行.
I have a dataframe with >2.7MM coordinates, and a separate list of ~2,000 coordinates. I'm trying to return the minimum distance between the coordinates in each individual row compared to every coordinate in the list. The following code works on a small scale (dataframe with 200 rows), but when calculating over 2.7MM rows, it seemingly runs forever.
from haversine import haversine
df
Latitude Longitude
39.989 -89.980
39.923 -89.901
39.990 -89.987
39.884 -89.943
39.030 -89.931
end_coords_list = [(41.342,-90.423),(40.349,-91.394),(38.928,-89.323)]
for row in df.itertuples():
def min_distance(row):
beg_coord = (row.Latitude, row.Longitude)
return min(haversine(beg_coord, end_coord) for end_coord in end_coords_list)
df['Min_Distance'] = df.apply(min_distance, axis=1)
我知道问题在于正在发生的计算数量过多(5.7MM * 2,000 = ~11.4BN),而且运行这么多循环的效率非常低.
I know the issue lies in the sheer number of calculations that are happening (5.7MM * 2,000 = ~11.4BN), and the fact that running this many loops is incredibly inefficient.
根据我的研究,似乎矢量化 NumPy 函数可能是更好的方法,但我是 Python 和 NumPy 的新手,所以我不太确定如何在这种特殊情况下实现这一点.
Based on my research, it seems like a vectorized NumPy function might be a better approach, but I'm new to Python and NumPy so I'm not quite sure how to implement this in this particular situation.
理想输出:
df
Latitude Longitude Min_Distance
39.989 -89.980 3.7
39.923 -89.901 4.1
39.990 -89.987 4.2
39.884 -89.943 5.9
39.030 -89.931 3.1
提前致谢!
推荐答案
haversine func
本质上是:
# convert all latitudes/longitudes from decimal degrees to radians
lat1, lng1, lat2, lng2 = map(radians, (lat1, lng1, lat2, lng2))
# calculate haversine
lat = lat2 - lat1
lng = lng2 - lng1
d = sin(lat * 0.5) ** 2 + cos(lat1) * cos(lat2) * sin(lng * 0.5) ** 2
h = 2 * AVG_EARTH_RADIUS * asin(sqrt(d))
这是一种利用强大的的矢量化方法NumPy 广播
和 NumPy ufuncs
来替换那些数学模块函数,以便我们一次性操作整个数组 -
Here's a vectorized method leveraging the powerful NumPy broadcasting
and NumPy ufuncs
to replace those math-module funcs so that we would operate on entire arrays in one go -
# Get array data; convert to radians to simulate 'map(radians,...)' part
coords_arr = np.deg2rad(coords_list)
a = np.deg2rad(df.values)
# Get the differentiations
lat = coords_arr[:,0] - a[:,0,None]
lng = coords_arr[:,1] - a[:,1,None]
# Compute the "cos(lat1) * cos(lat2) * sin(lng * 0.5) ** 2" part.
# Add into "sin(lat * 0.5) ** 2" part.
add0 = np.cos(a[:,0,None])*np.cos(coords_arr[:,0])* np.sin(lng * 0.5) ** 2
d = np.sin(lat * 0.5) ** 2 + add0
# Get h and assign into dataframe
h = 2 * AVG_EARTH_RADIUS * np.arcsin(np.sqrt(d))
df['Min_Distance'] = h.min(1)
为了进一步提升性能,我们可以使用 numexpr代码>模块
来替换超越函数.
For further performance boost, we can make use of numexpr
module to replace the transcendental funcs.
运行时测试和验证
方法 -
def loopy_app(df, coords_list):
for row in df.itertuples():
df['Min_Distance1'] = df.apply(min_distance, axis=1)
def vectorized_app(df, coords_list):
coords_arr = np.deg2rad(coords_list)
a = np.deg2rad(df.values)
lat = coords_arr[:,0] - a[:,0,None]
lng = coords_arr[:,1] - a[:,1,None]
add0 = np.cos(a[:,0,None])*np.cos(coords_arr[:,0])* np.sin(lng * 0.5) ** 2
d = np.sin(lat * 0.5) ** 2 + add0
h = 2 * AVG_EARTH_RADIUS * np.arcsin(np.sqrt(d))
df['Min_Distance2'] = h.min(1)
验证 -
In [158]: df
Out[158]:
Latitude Longitude
0 39.989 -89.980
1 39.923 -89.901
2 39.990 -89.987
3 39.884 -89.943
4 39.030 -89.931
In [159]: loopy_app(df, coords_list)
In [160]: vectorized_app(df, coords_list)
In [161]: df
Out[161]:
Latitude Longitude Min_Distance1 Min_Distance2
0 39.989 -89.980 126.637607 126.637607
1 39.923 -89.901 121.266241 121.266241
2 39.990 -89.987 126.037388 126.037388
3 39.884 -89.943 118.901195 118.901195
4 39.030 -89.931 53.765506 53.765506
时间 -
In [163]: df
Out[163]:
Latitude Longitude
0 39.989 -89.980
1 39.923 -89.901
2 39.990 -89.987
3 39.884 -89.943
4 39.030 -89.931
In [164]: %timeit loopy_app(df, coords_list)
100 loops, best of 3: 2.41 ms per loop
In [165]: %timeit vectorized_app(df, coords_list)
10000 loops, best of 3: 96.8 µs per loop
这篇关于有效计算Haversine距离的最小值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!