在Python中向量化Haversine距离计算 [英] Vectorizing Haversine distance calculation in Python

查看:444
本文介绍了在Python中向量化Haversine距离计算的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试为由Latitude& amp;标识的一长串位置计算距离矩阵.使用 Haversine 公式,该公式采用两个坐标对的元组来产生距离:

I am trying to calculate a distance matrix for a long list of locations identified by Latitude & Longitude using the Haversine formula that takes two tuples of coordinate pairs to produce the distance:

def haversine(point1, point2, miles=False):
    """ Calculate the great-circle distance bewteen two points on the Earth surface.

    :input: two 2-tuples, containing the latitude and longitude of each point
    in decimal degrees.

    Example: haversine((45.7597, 4.8422), (48.8567, 2.3508))

    :output: Returns the distance bewteen the two points.
    The default unit is kilometers. Miles can be returned
    if the ``miles`` parameter is set to True.

    """

我可以使用嵌套的for循环来计算所有点之间的距离,如下所示:

I can calculate the distance between all points using a nested for loop as follows:

data.head()

   id                      coordinates
0   1   (16.3457688674, 6.30354512503)
1   2    (12.494749307, 28.6263955635)
2   3    (27.794615136, 60.0324947881)
3   4   (44.4269923769, 110.114216113)
4   5  (-69.8540884125, 87.9468778773)

使用一个简单的功能:

distance = {}
def haver_loop(df):
    for i, point1 in df.iterrows():
        distance[i] = []
        for j, point2 in df.iterrows():
            distance[i].append(haversine(point1.coordinates, point2.coordinates))

    return pd.DataFrame.from_dict(distance, orient='index')

但是考虑到时间的复杂性,这需要花费相当长的时间,大约需要20秒才能获得500点,而且我的清单要长得多.这让我着眼于向量化,并且遇到了numpy.vectorize((文档),但无法弄清楚如何在这种情况下应用它.

But this takes quite a while given the time complexity, running at around 20s for 500 points and I have a much longer list. This has me looking at vectorization, and I've come across numpy.vectorize ((docs), but can't figure out how to apply it in this context.

推荐答案

您可以将函数用作np.vectorize()的参数,然后可以将其用作pandas.groupby.apply的参数,如下所示:

You would provide your function as an argument to np.vectorize(), and could then use it as an argument to pandas.groupby.apply as illustrated below:

haver_vec = np.vectorize(haversine, otypes=[np.int16])
distance = df.groupby('id').apply(lambda x: pd.Series(haver_vec(df.coordinates, x.coordinates)))

例如,示例数据如下:

length = 500
df = pd.DataFrame({'id':np.arange(length), 'coordinates':tuple(zip(np.random.uniform(-90, 90, length), np.random.uniform(-180, 180, length)))})

比较500点:

def haver_vect(data):
    distance = data.groupby('id').apply(lambda x: pd.Series(haver_vec(data.coordinates, x.coordinates)))
    return distance

%timeit haver_loop(df): 1 loops, best of 3: 35.5 s per loop

%timeit haver_vect(df): 1 loops, best of 3: 593 ms per loop

这篇关于在Python中向量化Haversine距离计算的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆