Python快速计算许多距离 [英] Python calculate lots of distances quickly

查看:582
本文介绍了Python快速计算许多距离的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我输入了36,742个点,这意味着如果我想计算距离矩阵的下三角(使用Vincenty近似值),则需要生成36,742 * 36,741 * 0.5 = 1,349,974,563个距离.

I have an input of 36,742 points which means if I wanted to calculate the lower triangle of a distance matrix (using the vincenty approximation) I would need to generate 36,742*36,741*0.5 = 1,349,974,563 distances.

我要保持成对的组合,并且彼此之间的距离不得超过50公里.我当前的设置如下

I want to keep the pair combinations which are within 50km of each other. My current set-up is as follows

shops= [[id,lat,lon]...]

def lower_triangle_mat(points):
    for i in range(len(shops)-1):
        for j in range(i+1,len(shops)):
            yield [shops[i],shops[j]]

def return_stores_cutoff(points,cutoff_km=0):
    below_cut = []
    counter = 0
    for x in lower_triangle_mat(points):
        dist_km = vincenty(x[0][1:3],x[1][1:3]).km
        counter += 1
        if counter % 1000000 == 0:
            print("%d out of %d" % (counter,(len(shops)*len(shops)-1*0.5)))
        if dist_km <= cutoff_km:
            below_cut.append([x[0][0],x[1][0],dist_km])
    return below_cut

start = time.clock()
stores = return_stores_cutoff(points=shops,cutoff_km=50)
print(time.clock() - start)

这显然需要几个小时.我在想的一些可能性:

This will obviously take hours and hours. Some possibilities I was thinking of:

  • 使用numpy来矢量化这些计算,而不是遍历
  • 使用某种散列进行快速粗切(所有商店均在100公里之内),然后仅计算这些商店之间的准确距离
  • 与其将点存储在列表中,不如使用四叉树,但我认为这仅有助于对接近点进行排名,而不是实际距离->因此,我猜想是某种地理数据库
  • 我显然可以尝试 haversine 或项目,并使用 euclidean 距离,但是我对使用最精确的量度感兴趣
  • 利用 parallel 处理(但是我在如何剪切列表以获取所有相关对的过程中遇到了一些困难).
  • Use numpy to vectorise these calculations rather than looping through
  • Use some kind of hashing to get a quick rough-cut off (all stores within 100km) and then only calculate accurate distances between those stores
  • Instead of storing the points in a list use something like a quad-tree but I think that only helps with the ranking of close points rather than actual distance -> so I guess some kind of geodatabase
  • I can obviously try the haversine or project and use euclidean distances, however I am interested in using the most accurate measure possible
  • Make use of parallel processing (however I was having a bit of difficulty coming up how to cut the list to still get all the relevant pairs).

编辑:我认为这里绝对需要进行哈希处理-例如来自 :

Edit: I think geohashing is definitely needed here - an example from:

from geoindex import GeoGridIndex, GeoPoint

geo_index = GeoGridIndex()
for _ in range(10000):
    lat = random.random()*180 - 90
    lng = random.random()*360 - 180
    index.add_point(GeoPoint(lat, lng))

center_point = GeoPoint(37.7772448, -122.3955118)
for distance, point in index.get_nearest_points(center_point, 10, 'km'):
    print("We found {0} in {1} km".format(point, distance))

但是,我还想对geo-hash返回的商店的距离计算进行矢量化(而不是循环).

However, I would also like to vectorise (instead of loop) the distance calculations for the stores returned by the geo-hash.

Edit2:Pouria Hadjibagheri -我尝试使用lambda和map:

Pouria Hadjibagheri - I tried using lambda and map:

# [B]: Mapping approach           
lwr_tr_mat = ((shops[i],shops[j]) for i in range(len(shops)-1) for j in range(i+1,len(shops)))

func = lambda x: (x[0][0],x[1][0],vincenty(x[0],x[1]).km)
# Trying to see if conditional statements slow this down
func_cond = lambda x: (x[0][0],x[1][0],vincenty(x[0],x[1]).km) if vincenty(x[0],x[1]).km <= 50 else None

start = time.clock()
out_dist = list(map(func,lwr_tr_mat))
print(time.clock() - start)

start = time.clock()
out_dist = list(map(func_cond,lwr_tr_mat))
print(time.clock() - start)

它们都在 61秒左右(我将商店数量从32,000个限制为2000个).也许我使用地图不正确?

And they were all around 61 seconds (I restricted number of stores to 2000 from 32,000). Perhaps I used map incorrectly?

推荐答案

这听起来像

This sounds like a classic use case for k-D trees.

如果您先将点转换为欧几里得空间,则可以使用query_pairs方法.spatial.cKDTree.html"rel =" nofollow> scipy.spatial.cKDTree :

If you first transform your points into Euclidean space then you can use the query_pairs method of scipy.spatial.cKDTree:

from scipy.spatial import cKDTree

tree = cKDTree(data)
# where data is (nshops, ndim) containing the Euclidean coordinates of each shop
# in units of km

pairs = tree.query_pairs(50, p=2)   # 50km radius, L2 (Euclidean) norm

pairs将是(i, j)元组的set,对应于彼此相距≤50km的商店对的行索引.

pairs will be a set of (i, j) tuples corresponding to the row indices of pairs of shops that are ≤50km from each other.

tree.sparse_distance_matrix scipy.sparse.dok_matrix .由于矩阵是对称的,并且您只对唯一的行/列对感兴趣,因此可以使用

The output of tree.sparse_distance_matrix is a scipy.sparse.dok_matrix. Since the matrix will be symmetric and you're only interested in unique row/column pairs, you could use scipy.sparse.tril to zero out the upper triangle, giving you a scipy.sparse.coo_matrix. From there you can access the nonzero row and column indices and their corresponding distance values via the .row, .col and .data attributes:

from scipy import sparse

tree_dist = tree.sparse_distance_matrix(tree, max_distance=10000, p=2)
udist = sparse.tril(tree_dist, k=-1)    # zero the main diagonal
ridx = udist.row    # row indices
cidx = udist.col    # column indices
dist = udist.data   # distance values

这篇关于Python快速计算许多距离的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆