pandas :计算每组行内的正弦距离 [英] Pandas: calculate haversine distance within each group of rows

查看:122
本文介绍了 pandas :计算每组行内的正弦距离的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

示例CSV如下:

 user_id  lat         lon
    1   19.111841   72.910729
    1   19.111342   72.908387
    2   19.111542   72.907387
    2   19.137815   72.914085
    2   19.119677   72.905081
    2   19.129677   72.905081
    3   19.319677   72.905081
    3   19.120217   72.907121
    4   19.420217   72.807121
    4   19.520217   73.307121
    5   19.319677   72.905081
    5   19.419677   72.805081
    5   19.629677   72.705081
    5   19.111860   72.911347
    5   19.111860   72.931346
    5   19.219677   72.605081
    6   19.319677   72.805082
    6   19.419677   72.905086

我知道我可以使用 haversine 进行距离计算(并且python也具有hasrsine软件包):

I know I can use haversine for distance calculation (and python also has haversine package):

def haversine(lon1, lat1, lon2, lat2):
    """
    Calculate the great circle distance between two points 
    on the earth (specified in decimal degrees).
    Source: http://gis.stackexchange.com/a/56589/15183
    """
    # convert decimal degrees to radians 
    lon1, lat1, lon2, lat2 = map(math.radians, [lon1, lat1, lon2, lat2])
    # haversine formula 
    dlon = lon2 - lon1 
    dlat = lat2 - lat1 
    a = math.sin(dlat/2)**2 + math.cos(lat1) * math.cos(lat2) * math.sin(dlon/2)**2
    c = 2 * math.asin(math.sqrt(a)) 
    km = 6371 * c
    return km

但是,我只想计算相同ID 内的距离. 因此,预期的答案将是这样的:

However, I only want to calculate distances within same id. So the expected answer would be like this:

user_id  lat         lon    result
    1   19.111841   72.910729   NaN
    1   19.111342   72.908387   xx*
    2   19.111542   72.907387   NaN
    2   19.137815   72.914085   xx
    2   19.119677   72.905081   xx
    2   19.129677   72.905081   xx
    3   19.319677   72.905081   NaN
    3   19.120217   72.907121   xx
    4   19.420217   72.807121   NaN
    4   19.520217   73.307121   xx
    5   19.319677   72.905081   NaN
    5   19.419677   72.805081   xx
    5   19.629677   72.705081   xx
    5   19.111860   72.911347   xx
    5   19.111860   72.931346   xx
    5   19.219677   72.605081   xx
    6   19.319677   72.805082   NaN
    6   19.419677   72.905086   xx

*:xx是以公里为单位的距离数.

*: xx are numbers of distance in km.

我该怎么做?

PS 我正在使用熊猫

推荐答案

尝试以下方法:

import pandas as pd
import numpy as np

# parse CSV to DataFrame. You may want to specify the separator (`sep='...'`)
df = pd.read_csv('/path/to/file.csv')

# vectorized haversine function
def haversine(lat1, lon1, lat2, lon2, to_radians=True, earth_radius=6371):
    """
    slightly modified version: of http://stackoverflow.com/a/29546836/2901002

    Calculate the great circle distance between two points
    on the earth (specified in decimal degrees or in radians)

    All (lat, lon) coordinates must have numeric dtypes and be of equal length.

    """
    if to_radians:
        lat1, lon1, lat2, lon2 = np.radians([lat1, lon1, lat2, lon2])

    a = np.sin((lat2-lat1)/2.0)**2 + \
        np.cos(lat1) * np.cos(lat2) * np.sin((lon2-lon1)/2.0)**2

    return earth_radius * 2 * np.arcsin(np.sqrt(a))

现在,我们可以计算属于同一id(组)的坐标之间的距离:

Now we can calculate distances between coordinates belonging to the same id (group):

df['dist'] = \
    np.concatenate(df.groupby('id')
                     .apply(lambda x: haversine(x['lat'], x['lon'],
                                                x['lat'].shift(), x['lon'].shift())).values)

结果:

In [105]: df
Out[105]:
    id        lat        lon       dist
0    1  19.111841  72.910729        NaN
1    1  19.111342  72.908387   0.252243
2    2  19.111542  72.907387        NaN
3    2  19.137815  72.914085   3.004976
4    2  19.119677  72.905081   2.227658
5    2  19.129677  72.905081   1.111949
6    3  19.319677  72.905081        NaN
7    3  19.120217  72.907121  22.179974
8    4  19.420217  72.807121        NaN
9    4  19.520217  73.307121  53.584504
10   5  19.319677  72.905081        NaN
11   5  19.419677  72.805081  15.286775
12   5  19.629677  72.705081  25.594890
13   5  19.111860  72.911347  61.509917
14   5  19.111860  72.931346   2.101215
15   5  19.219677  72.605081  36.304756
16   6  19.319677  72.805082        NaN
17   6  19.419677  72.905086  15.287063

这篇关于 pandas :计算每组行内的正弦距离的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆