有效地计算两个数据集之间的成对的Haversine距离-NumPy/Python [英] Efficiently compute pairwise haversine distances between two datasets - NumPy / Python

查看:525
本文介绍了有效地计算两个数据集之间的成对的Haversine距离-NumPy/Python的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想计算纬度-经度之间的地理距离.

I want to calculate the geo-distance between latitude-longitude.

我已经检查了该线程在Python中向量化Haversine距离计算 但是当我将其用于两个不同的坐标集时,会出现错误.

I had checked this thread Vectorizing Haversine distance calculation in Python but when I am using it for two different set of coordinates, I m getting an error.

df1的大小可以达到数百万,如果还有其他方法可以在更短的时间内计算出准确的地理距离,那将非常有帮助.

df1 size can be in millions and if there is any other way to calculate accurate geo distance in less time then it would be really helpful.

length1 = 1000
d1 = np.random.uniform(-90, 90, length1)
d2 = np.random.uniform(-180, 180, length1)
length2 = 100
d3 = np.random.uniform(-90, 90, length2)
d4 = np.random.uniform(-180, 180, length2)
coords = tuple(zip(d1, d2))
df1 = pd.DataFrame({'coordinates':coords})
coords = tuple(zip(d3, d4))
df2 = pd.DataFrame({'coordinates':coords})

def get_diff(df1, df2):
    data1 = np.array(df1['coordinates'].tolist())
    data2 = np.array(df2['coordinates'].tolist())
    lat1 = data1[:,0]                     
    lng1 = data1[:,1]
    lat2 = data2[:,0]                     
    lng2 = data2[:,1]
    #print(lat1.shape)
    #print(lng1.shape)
    #print(lat2.shape)
    #print(lng2.shape)
    diff_lat = lat1[:,None] - lat2

    diff_lng = lng1[:,None] - lng2
    #print(diff_lat.shape)
    #print(diff_lng.shape)
    d = np.sin(diff_lat/2)**2 + np.cos(lat1[:,None])*np.cos(lat1) * np.sin(diff_lng/2)**2
    return 2 * 6371 * np.arcsin(np.sqrt(d))

get_diff(df1, df2)

ValueError                                Traceback (most recent call last)
<ipython-input-58-df06c7cff72c> in <module>
----> 1 get_diff(df1, df2)

<ipython-input-57-9bd8f10189e6> in get_diff(df1, df2)
     26     print(diff_lat.shape)
     27     print(diff_lng.shape)
---> 28     d = np.sin(diff_lat/2)**2 + np.cos(lat1[:,None])*np.cos(lat1) * np.sin(diff_lng/2)**2
     29     return 2 * 6371 * np.arcsin(np.sqrt(d))

ValueError: operands could not be broadcast together with shapes (1000,1000) (1000,100) 

推荐答案

成对的正弦距离

这是基于 this post -

def convert_to_arrays(df1, df2):
    d1 = np.array(df1['coordinates'].tolist())
    d2 = np.array(df2['coordinates'].tolist())
    return d1,d2

def broadcasting_based_lng_lat(data1, data2):
    # data1, data2 are the data arrays with 2 cols and they hold
    # lat., lng. values in those cols respectively
    data1 = np.deg2rad(data1)                     
    data2 = np.deg2rad(data2)                     

    lat1 = data1[:,0]                     
    lng1 = data1[:,1]         

    lat2 = data2[:,0]                     
    lng2 = data2[:,1]         

    diff_lat = lat1[:,None] - lat2
    diff_lng = lng1[:,None] - lng2
    d = np.sin(diff_lat/2)**2 + np.cos(lat1[:,None])*np.cos(lat2) * np.sin(diff_lng/2)**2
    return 2 * 6371 * np.arcsin(np.sqrt(d))

因此,要解决您的问题以获取所有成对的Haversine距离,应该是-

Hence, to solve your case to get all pairwise haversine distances, it would be -

broadcasting_based_lng_lat(*convert_to_arrays(df1,df2))


逐元素的haversine距离

对于两个数据之间按元素进行的haversine距离计算,以使每个数据分别在两列或每个两个元素的列表中包含经度和纬度,我们将跳过对2D的某些扩展,最后得到类似这个-


Elementwise haversine distances

For element-wise haversine distance computations between two data, such that each data holds latitude and longitude in two columns each or lists of two elements each, we would skip some of the extensions to 2D and end up with something like this -

def broadcasting_based_lng_lat_elementwise(data1, data2):
    # data1, data2 are the data arrays with 2 cols and they hold
    # lat., lng. values in those cols respectively
    data1 = np.deg2rad(data1)                     
    data2 = np.deg2rad(data2)                     

    lat1 = data1[:,0]                     
    lng1 = data1[:,1]         

    lat2 = data2[:,0]                     
    lng2 = data2[:,1]         

    diff_lat = lat1 - lat2
    diff_lng = lng1 - lng2
    d = np.sin(diff_lat/2)**2 + np.cos(lat1)*np.cos(lat2) * np.sin(diff_lng/2)**2
    return 2 * 6371 * np.arcsin(np.sqrt(d))

使用一个数据帧运行示例,该数据帧将两个数据保存在两列中-

Sample run with a dataframe holding the two data in two columns -

In [42]: np.random.seed(0)
    ...: a = np.random.randint(10,100,(5,2)).tolist()
    ...: b = np.random.randint(10,100,(5,2)).tolist()
    ...: df = pd.DataFrame({'A':a,'B':b})

In [43]: df
Out[43]: 
          A         B
0  [54, 57]  [80, 98]
1  [74, 77]  [98, 22]
2  [77, 19]  [68, 75]
3  [93, 31]  [49, 97]
4  [46, 97]  [56, 98]

In [44]: from haversine import haversine

In [45]: [haversine(i,j) for (i,j) in zip(df.A,df.B)]
Out[45]: 
[3235.9659882513424,
 2399.6124657290075,
 2012.0851666001824,
 4702.8069773315865,
 1114.1193334220534]

In [46]: broadcasting_based_lng_lat_elementwise(np.vstack(df.A), np.vstack(df.B))
Out[46]: 
array([3235.96151855, 2399.60915125, 2012.08238739, 4702.80048155,
       1114.11779454])

这些细微的差异主要是因为 haversine假定6371.0088为地球半径,而此处我们将其取为6371.

Those slight differences are largely because haversine library assumes 6371.0088 as the earth radius, while we are taking it as 6371 here.

这篇关于有效地计算两个数据集之间的成对的Haversine距离-NumPy/Python的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆