快速Haversine近似(Python/Pandas) [英] Fast Haversine Approximation (Python/Pandas)

查看:126
本文介绍了快速Haversine近似(Python/Pandas)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

Pandas数据框中的每一行都包含2点的经/纬度坐标.使用下面的Python代码,计算许多(几百万)行的这两个点之间的距离会花费很长时间!

Each row in a Pandas dataframe contains lat/lng coordinates of 2 points. Using the Python code below, calculating the distances between these 2 points for many (millions) of rows takes a very long time!

考虑到两个点相距50英里以内,准确性不是很重要,是否可以使计算速度更快?

Considering that the 2 points are under 50 miles apart and accuracy is not very important, is it possible to make the calculation faster?

from math import radians, cos, sin, asin, sqrt
def haversine(lon1, lat1, lon2, lat2):
    """
    Calculate the great circle distance between two points 
    on the earth (specified in decimal degrees)
    """
    # convert decimal degrees to radians 
    lon1, lat1, lon2, lat2 = map(radians, [lon1, lat1, lon2, lat2])
    # haversine formula 
    dlon = lon2 - lon1 
    dlat = lat2 - lat1 
    a = sin(dlat/2)**2 + cos(lat1) * cos(lat2) * sin(dlon/2)**2
    c = 2 * asin(sqrt(a)) 
    km = 6367 * c
    return km


for index, row in df.iterrows():
    df.loc[index, 'distance'] = haversine(row['a_longitude'], row['a_latitude'], row['b_longitude'], row['b_latitude'])

推荐答案

以下是该函数的矢量化numpy版本:

Here is a vectorized numpy version of the same function:

import numpy as np

def haversine_np(lon1, lat1, lon2, lat2):
    """
    Calculate the great circle distance between two points
    on the earth (specified in decimal degrees)

    All args must be of equal length.    

    """
    lon1, lat1, lon2, lat2 = map(np.radians, [lon1, lat1, lon2, lat2])

    dlon = lon2 - lon1
    dlat = lat2 - lat1

    a = np.sin(dlat/2.0)**2 + np.cos(lat1) * np.cos(lat2) * np.sin(dlon/2.0)**2

    c = 2 * np.arcsin(np.sqrt(a))
    km = 6367 * c
    return km

输入都是值的数组,它应该能够立即完成数百万个点.要求输入是ndarrays,但您的pandas表中的列将起作用.

The inputs are all arrays of values, and it should be able to do millions of points instantly. The requirement is that the inputs are ndarrays but the columns of your pandas table will work.

例如,使用随机生成的值:

For example, with randomly generated values:

>>> import numpy as np
>>> import pandas
>>> lon1, lon2, lat1, lat2 = np.random.randn(4, 1000000)
>>> df = pandas.DataFrame(data={'lon1':lon1,'lon2':lon2,'lat1':lat1,'lat2':lat2})
>>> km = haversine_np(df['lon1'],df['lat1'],df['lon2'],df['lat2'])

或者如果您要创建另一列:

Or if you want to create another column:

>>> df['distance'] = haversine_np(df['lon1'],df['lat1'],df['lon2'],df['lat2'])

在python中遍历数据数组非常慢. Numpy提供的功能可对整个数据数组进行操作,从而使您避免循环并大大提高性能.

Looping through arrays of data is very slow in python. Numpy provides functions that operate on entire arrays of data, which lets you avoid looping and drastically improve performance.

这是向量化的示例.

这篇关于快速Haversine近似(Python/Pandas)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆