Python-低效的空间距离计算(如何加快速度) [英] Python - inefficient spatial distance calculation (how can it be speed up)

查看:66
本文介绍了Python-低效的空间距离计算(如何加快速度)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我目前正在尝试使用Python进行一些地理编码.过程如下:我有两个具有经度和纬度值的数据框(df1和df2,房屋和学校),并且想要为df1中的每个观测值找到df2中最接近的邻居.我使用以下代码:

I am currently trying some geocoding in Python. The process is the following: I have two data frames (df1 and df2, houses and schools) with latitude and longitude values and want to find the nearest neighbour in df2 for every observations in df1. I use the following code:

from tqdm import tqdm
import numpy as np
import pandas as pd
import math 

def distance(lat1, long1, lat2, long2):
        R = 6371 # Earth Radius in Km
        dLat = math.radians(lat2 - lat1) # Convert Degrees 2 Radians 
        dLong = math.radians(long2 - long1)
        lat1 = math.radians(lat1)
        lat2 = math.radians(lat2)
        a = math.sin(dLat/2) * math.sin(dLat/2) + math.sin(dLong/2) * math.sin(dLong/2) * math.cos(lat1) * math.cos(lat2)
        c = 2 * math.atan2(math.sqrt(a), math.sqrt(1-a))
        d = R * c
        return d

dists = []
schools =[]
for index, row1 in tqdm(df1.iterrows()):
    for index, row2 in df2.iterrows():
        dists.append(distance(row1.lat, row1.lng, row2.Latitude, row2.Longitude))
    schools.append(min(dists))
    del dists [:]

df1["school"] = pd.Series(schools)

该代码有效,但是要花一些时间.使用tqdm时,平均速度为每秒2次df1迭代.作为比较,我使用geonear在STATA中完成了整个任务,并且在df1(950)中的所有观测都花费了1秒.我在geonear的帮助文件中看到,它们使用聚类,而不是计算所有距离,而是仅计算最接近的距离.但是,在添加集群功能(这可能还需要CPU的功能)之前,我想知道是否有人看到某种方法来按原样加速该过程(我是python的新手,可能有一些效率低下的代码会减慢该过程的速度).还是有一个可以使过程更快的软件包?

The code works, however it takes ages. With tqdm I get an average speed of 2 iterations of df1 per second. As a comparison, I did the whole task in STATA with geonear and it takes 1 second for all observations in df1 (950). I read in the helpfile of geonear that they use clustering, to not calculate all distances, but only the closest. However, before I add a clustering function (which might also take CPU power), I wonder if someone sees some way to speed up the process as it is (I am new to python and might have some inefficient code that slows the process down). Or is there maybe a package that does the process faster?

如果它比STATA花费更长的时间,但还不到7分钟,我会没事的.

I would be ok if it takes longer than in STATA, but not nearly 7 minutes...

提前谢谢

推荐答案

您执行此操作的方式很慢,因为您使用的是 O(n²)算法:每一行彼此看排.乔治的答案在引入向量化的同时并不能解决这种根本的效率低下问题.

The way you're doing this is slow because you are using an O(n²) algorithm: each row looks at every other row. Georgy's answer, while introducing vectorization, does not solve this fundamental inefficiency.

我建议将您的数据点加载到

I'd recommend loading your data points into a kd-tree: this data structure provides a fast way of finding nearest neighbours in multiple dimensions. Construction of such a tree is in O(n log n) and a query takes O(log n), so total time is in O(n log n).

如果您的数据被定位在飞机可以很好地近似的地理区域,请投影您的数据,然后在二维中执行查找.否则,如果您的数据分散在全球,请投影到球形笛卡尔坐标并执行外观-在那里.

If your data is localized to a geographic region that can be well-approximated by a plane, project your data and then perform the lookup in two dimensions. Otherwise, if your data is globally dispersed, project into spherical cartesian coordinates and perform the look-up there.

如何执行此操作的示例如下所示:

An example of how you might do this appears as follows:

#/usr/bin/env python3

import numpy as np
import scipy as sp
import scipy.spatial

Rearth = 6371

#Generate uniformly-distributed lon-lat points on a sphere
#See: http://mathworld.wolfram.com/SpherePointPicking.html
def GenerateUniformSpherical(num):
  #Generate random variates
  pts      = np.random.uniform(low=0, high=1, size=(num,2))
  #Convert to sphere space
  pts[:,0] = 2*np.pi*pts[:,0]          #0-360 degrees
  pts[:,1] = np.arccos(2*pts[:,1]-1)   #0-180 degrees
  #Convert to degrees
  pts = np.degrees(pts)
  #Shift ranges to lon-lat
  pts[:,0] -= 180
  pts[:,1] -= 90
  return pts

def ConvertToXYZ(lonlat):
  theta  = np.radians(lonlat[:,0])+np.pi
  phi    = np.radians(lonlat[:,1])+np.pi/2
  x      = Rearth*np.cos(theta)*np.sin(phi)
  y      = Rearth*np.sin(theta)*np.sin(phi)
  z      = Rearth*np.cos(phi)
  return np.transpose(np.vstack((x,y,z)))

#For each entry in qpts, find the nearest point in the kdtree
def GetNearestNeighbours(qpts,kdtree):
  pts3d        = ConvertToXYZ(qpts)
  #See: https://docs.scipy.org/doc/scipy-0.14.0/reference/generated/scipy.spatial.KDTree.query.html#scipy.spatial.KDTree.query
  #p=2 implies Euclidean distance, eps=0 implies no approximation (slower)
  return kdtree.query(pts3d,p=2,eps=0) 

#Generate uniformly-distributed test points on a sphere. Note that you'll want
#to find a way to extract your pandas columns into an array of width=2, height=N
#to match this format.
df1 = GenerateUniformSpherical(10000)
df2 = GenerateUniformSpherical(10000)

#Convert df2 into XYZ coordinates. WARNING! Do not alter df2_3d or kdtree will
#malfunction!
df2_3d = ConvertToXYZ(df2)
#Build a kd-tree from df2_3D
kdtree = sp.spatial.KDTree(df2_3d, leafsize=10) #Stick points in kd-tree for fast look-up

#Return the distance to, and index of, each of df1's nearest neighbour points
distance, indices = GetNearestNeighbours(df1,kdtree)

这篇关于Python-低效的空间距离计算(如何加快速度)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆