查找数据集中所有点的距离最近点 - Python [英] Find the nearest point in distance for all the points in the dataset - Python
问题描述
我有一个数据集如下,
Id Latitude longitude
1 25.42 55.47
2 25.39 55.47
3 24.48 54.38
4 24.51 54.54
我想为数据集的每个点找到最近的距离.我在网上找到了如下距离函数,
I want to find the nearest distance for every point for the dataset. I found the following distance function in the internet,
from math import radians, cos, sin, asin, sqrt
def distance(lon1, lat1, lon2, lat2):
"""
Calculate the great circle distance between two points
on the earth (specified in decimal degrees)
"""
# convert decimal degrees to radians
lon1, lat1, lon2, lat2 = map(radians, [lon1, lat1, lon2, lat2])
# haversine formula
dlon = lon2 - lon1
dlat = lat2 - lat1
a = sin(dlat/2)**2 + cos(lat1) * cos(lat2) * sin(dlon/2)**2
c = 2 * asin(sqrt(a))
km = 6367 * c
return km
我正在使用以下功能,
shortest_distance = []
for i in range(1,len(data)):
distance1 = []
for j in range(1,len(data)):
distance1.append(distance(data['Longitude'][i], data['Latitude'][i], data['Longitude'][j], data['Latitude'][j]))
shortest_distance.append(min(distance1))
但是这段代码为每个条目循环两次并返回 n^2 次迭代,反过来它非常慢.我的数据集包含近 100 万条记录,每次遍历所有元素两次都变得非常昂贵.
But this code is looping twice for each entry and return n^2 iterations and in turn it is very slow. My dataset contains nearly 1 million records and each time looping through all the elements twice is becoming very costly.
我想找到更好的方法来找出每一行的最近点.任何人都可以帮我找到一种在 python 中解决这个问题的方法吗?
I want to find the better way to find out the nearest point for each row. Can anybody help me in finding a way to solve this in python ?
谢谢
推荐答案
寻找 N
个点中最近的一个给定点的蛮力方法是 O(N)
-- 你必须检查每个点.相反,如果 N
点存储在 KD-tree,然后找到最近的点平均是O(log(N))
.构建 KD 树还有额外的一次性成本,这需要 O(N)
时间.
The brute force method of finding the nearest of N
points to a given point is O(N)
-- you'd have to check each point.
In contrast, if the N
points are stored in a KD-tree, then finding the nearest point is on average O(log(N))
.
There is also the additional one-time cost of building the KD-tree, which requires O(N)
time.
如果需要重复这个过程N
次,那么蛮力方法是O(N**2)
,kd-tree方法是O(N*log(N))
.因此,对于足够大的 N
,KD 树将击败蛮力方法.
If you need to repeat this process N
times, then the brute force method is O(N**2)
and the kd-tree method is O(N*log(N))
.
Thus, for large enough N
, the KD-tree will beat the brute force method.
请参阅此处了解有关最近邻算法的更多信息(包括 KD 树).
See here for more on nearest neighbor algorithms (including KD-tree).
下面(在函数 using_kdtree
中)是一种使用 scipy.spatial.kdtree
.
Below (in the function using_kdtree
) is a way to compute the great circle arclengths of nearest neighbors using scipy.spatial.kdtree
.
scipy.spatial.kdtree
使用点之间的欧几里得距离,但有一个 公式 用于将球体上点之间的欧几里得弦距转换为大圆弧长(给定球体的半径).所以想法是将纬度/经度数据转换成笛卡尔坐标,使用KDTree
来寻找最近的邻居,然后应用大圆距离公式 获得想要的结果.
scipy.spatial.kdtree
uses the Euclidean distance between points, but there is a formula for converting Euclidean chord distances between points on a sphere to great circle arclength (given the radius of the sphere).
So the idea is to convert the latitude/longitude data into cartesian coordinates, use a KDTree
to find the nearest neighbors, and then apply the great circle distance formula to obtain the desired result.
以下是一些基准.使用 N = 100
,using_kdtree
比 orig
(蛮力)方法快 39 倍.
Here are some benchmarks. Using N = 100
, using_kdtree
is 39x faster than the orig
(brute force) method.
In [180]: %timeit using_kdtree(data)
100 loops, best of 3: 18.6 ms per loop
In [181]: %timeit using_sklearn(data)
1 loop, best of 3: 214 ms per loop
In [179]: %timeit orig(data)
1 loop, best of 3: 728 ms per loop
对于 N = 10000
:
In [5]: %timeit using_kdtree(data)
1 loop, best of 3: 2.78 s per loop
In [6]: %timeit using_sklearn(data)
1 loop, best of 3: 1min 15s per loop
In [7]: %timeit orig(data)
# untested; too slow
因为 using_kdtree
是 O(N log(N))
而 orig
是 O(N**2)代码>,因子为
using_kdtree
比 orig
快的哪个会随着 N
增长,长度为数据
,增长.
Since using_kdtree
is O(N log(N))
and orig
is O(N**2)
, the factor by
which using_kdtree
is faster than orig
will grow as N
, the length of
data
, grows.
import numpy as np
import scipy.spatial as spatial
import pandas as pd
import sklearn.neighbors as neighbors
from math import radians, cos, sin, asin, sqrt
R = 6367
def using_kdtree(data):
"Based on https://stackoverflow.com/q/43020919/190597"
def dist_to_arclength(chord_length):
"""
https://en.wikipedia.org/wiki/Great-circle_distance
Convert Euclidean chord length to great circle arc length
"""
central_angle = 2*np.arcsin(chord_length/(2.0*R))
arclength = R*central_angle
return arclength
phi = np.deg2rad(data['Latitude'])
theta = np.deg2rad(data['Longitude'])
data['x'] = R * np.cos(phi) * np.cos(theta)
data['y'] = R * np.cos(phi) * np.sin(theta)
data['z'] = R * np.sin(phi)
tree = spatial.KDTree(data[['x', 'y','z']])
distance, index = tree.query(data[['x', 'y','z']], k=2)
return dist_to_arclength(distance[:, 1])
def orig(data):
def distance(lon1, lat1, lon2, lat2):
"""
Calculate the great circle distance between two points
on the earth (specified in decimal degrees)
"""
# convert decimal degrees to radians
lon1, lat1, lon2, lat2 = map(radians, [lon1, lat1, lon2, lat2])
# haversine formula
dlon = lon2 - lon1
dlat = lat2 - lat1
a = sin(dlat/2.0)**2 + cos(lat1) * cos(lat2) * sin(dlon/2.0)**2
c = 2 * asin(sqrt(a))
km = R * c
return km
shortest_distance = []
for i in range(len(data)):
distance1 = []
for j in range(len(data)):
if i == j: continue
distance1.append(distance(data['Longitude'][i], data['Latitude'][i],
data['Longitude'][j], data['Latitude'][j]))
shortest_distance.append(min(distance1))
return shortest_distance
def using_sklearn(data):
"""
Based on https://stackoverflow.com/a/45127250/190597 (Jonas Adler)
"""
def distance(p1, p2):
"""
Calculate the great circle distance between two points
on the earth (specified in decimal degrees)
"""
lon1, lat1 = p1
lon2, lat2 = p2
# convert decimal degrees to radians
lon1, lat1, lon2, lat2 = map(np.radians, [lon1, lat1, lon2, lat2])
# haversine formula
dlon = lon2 - lon1
dlat = lat2 - lat1
a = np.sin(dlat/2)**2 + np.cos(lat1) * np.cos(lat2) * np.sin(dlon/2)**2
c = 2 * np.arcsin(np.sqrt(a))
km = R * c
return km
points = data[['Longitude', 'Latitude']]
nbrs = neighbors.NearestNeighbors(n_neighbors=2, metric=distance).fit(points)
distances, indices = nbrs.kneighbors(points)
result = distances[:, 1]
return result
np.random.seed(2017)
N = 1000
data = pd.DataFrame({'Latitude':np.random.uniform(-90,90,size=N),
'Longitude':np.random.uniform(0,360,size=N)})
expected = orig(data)
for func in [using_kdtree, using_sklearn]:
result = func(data)
assert np.allclose(expected, result)
这篇关于查找数据集中所有点的距离最近点 - Python的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!