向量化以计算许多距离 [英] Vectorization to calculate many distances

查看:113
本文介绍了向量化以计算许多距离的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我是numpy/pandas和矢量化计算的新手.我正在执行数据任务,其中有两个数据集.数据集1包含一个经度和纬度的位置列表以及一个变量A.数据集2还包含一个经度和纬度的位置列表.对于数据集1中的每个位置,我想计算其到数据集2中所有位置的距离,但是我只想获得数据集2中小于变量A的值的位置计数.数据集非常大,因此我需要使用向量化运算来加快计算速度.

I am new to numpy/pandas and vectorized computation. I am doing a data task where I have two datasets. Dataset 1 contains a list of places with their longitude and latitude and a variable A. Dataset 2 also contains a list of places with their longitude and latitude. For each place in dataset 1, I would like to calculate its distances to all the places in dataset 2 but I would only like to get a count of places in dataset 2 that are less than the value of variable A. Note also both of the datasets are very large, so that I need to use vectorized operations to expedite the computation.

例如,我的数据集1可能如下所示:

For example, my dataset1 may look like below:

id lon    lat   varA
1  20.11 19.88  100
2  20.87 18.65  90
3  18.99 20.75  120

我的数据集2可能如下所示:

and my dataset2 may look like below:

placeid lon lat 
a       18.75 20.77
b       19.77 22.56
c       20.86 23.76
d       17.55 20.74 

然后对于数据集1中的id == 1,我想计算它到数据集2中所有四个点(a,c,c,d)的距离,并且我想知道距离中有多少个小于比相应的varA值大.例如,计算出的四个距离为90、70、120、110,而varA为100.则该值应为2.

Then for id == 1 in dataset1, I would like to calculate its distances to all four points (a,c,c,d) in dataset2 and I would like to have a count of how many of the distances are less than the corresponding value of varA. For example, the four distances calculated are 90, 70, 120, 110 and varA is 100. Then the value should be 2.

我已经有一个向量化函数来计算两对坐标之间的距离.假设函数(haversine(x,y))正确实现,我有以下代码.

I already have a vectorized function to calculate distance between the two pair of coordinates. Suppose the function (haversine(x,y)) is properly implemented, I have the following code.

dataset2['count'] = dataset1.apply(lambda x: 
haversine(x['lon'],x['lat'],dataset2['lon'], dataset2['lat']).shape[0], axis 
= 1)

但是,这给出的是行的总数,而不是满足我要求的行数.

However, this gives the total number of rows, but not the ones that satisfy my requirements.

任何人都可以指出我如何使代码正常工作吗?

Would anyone be able to point me how to make the code work?

推荐答案

如果可以将坐标投影到局部投影(例如 UTM ),它与pyproj相当直接,并且通常比lon/lat更有利于测量,因此使用 MUCH 的方法要快得多c1>. df['something'] = df.apply(...)np.vectorize()都没有真正地向量化,在后台,它们使用循环.

If you can project the coordinates to a local projection (e.g. UTM), which is pretty straight forward with pyproj and generally more favorable than lon/lat for measurement, then there is a much much MUCH faster way using scipy.spatial. Neither of df['something'] = df.apply(...) and np.vectorize() are not truly vectorized, under the hood, they use looping.

ds1
    id  lon lat varA
0   1   20.11   19.88   100
1   2   20.87   18.65   90
2   3   18.99   20.75   120

ds2
    placeid lon lat
0   a   18.75   20.77
1   b   19.77   22.56
2   c   20.86   23.76
3   d   17.55   20.74


from scipy.spatial import distance

# gey coordinates of each set of points as numpy array
coords_a = ds1.values[:,(1,2)]
coords_b = ds2.values[:, (1,2)]
coords_a
#out: array([[ 20.11,  19.88],
#       [ 20.87,  18.65],
#       [ 18.99,  20.75]])

distances = distance.cdist(coords_a, coords_b)
#out: array([[ 1.62533074,  2.70148108,  3.95182236,  2.70059253],
#       [ 2.99813275,  4.06178532,  5.11000978,  3.92307278],
#       [ 0.24083189,  1.97091349,  3.54358575,  1.44003472]])

实际上,

distances是每对点之间的距离. coords_a.shape(3, 2)coords_b.shape(4, 2),因此结果是(3,4). np.distance的默认度量标准是eculidean,但是还有其他度量标准. 为了这个示例,我们假设vara是:

distances is in fact distance between every pair of points. coords_a.shape is (3, 2) and coords_b.shape is (4, 2), so the result is (3,4). The default metric for np.distance is eculidean, but there are other metrics as well. For the sake of this example, let's assume vara is:

vara = np.array([2,4.5,2])

(而不是100 90 120).我们需要确定第一行中的distances中哪个值小于2,第二行中的哪个值小于4.5,...,解决此问题的一种方法是从相应的行中减去vara中的每个值(请注意,我们必须调整vara的大小):

(instead of 100 90 120). We need to identify which value in distances in row one is smaller than 2, in row two smaller that 4.5,..., one way to solve this problem is subtracting each value in vara from corresponding row (note that we must resize vara):

vara.resize(3,1)
res = res - vara
#out: array([[-0.37466926,  0.70148108,  1.95182236,  0.70059253],
#       [-1.50186725, -0.43821468,  0.61000978, -0.57692722],
#       [-1.75916811, -0.02908651,  1.54358575, -0.55996528]])

然后将正值设置为零并将负值设置为正值将为我们提供最终数组:

then setting positive values to zero and making negative values positive will give us the final array:

res[res>0] = 0
res = np.absolute(res)
#out: array([[ 0.37466926,  0.        ,  0.        ,  0.        ],
#            [ 1.50186725,  0.43821468,  0.        ,  0.57692722],
#            [ 1.75916811,  0.02908651,  0.        ,  0.55996528]])

现在,对每一行求和:

sum_ = res.sum(axis=1)
#out:  array([ 0.37466926,  2.51700915,  2.34821989])

并计算每一行中的项目:

and to count the items in each row:

count = np.count_nonzero(res, axis=1)
#out: array([1, 3, 3])

这是一个完全矢量化的(自定义)解决方案,您可以根据自己的喜好进行调整,并且可以适应任何级别的复杂性.另一个解决方案是cKDTree.该代码来自文档.可以很容易地将其用于您的问题,但是如果您需要帮助,请不要犹豫.

This is a fully vectorized (custom) solution which you can tweak to your liking and should accommodate any level of complexity. yet another solution is cKDTree. the code is from documentation. it should be fairly easy to adopt it to your problem, but in case you need assistance don't hesitate to ask.

x, y = np.mgrid[0:4, 0:4]
points = zip(x.ravel(), y.ravel())
tree = spatial.cKDTree(points)
tree.query_ball_point([2, 0], 1)
[4, 8, 9, 12]

query_ball_point()查找点x的距离r内的所有点,而且速度惊人.

query_ball_point() finds all points within distance r of point(s) x, and it is amazingly fast.

最后一点:不要将这些算法用于经纬度输入,特别是如果您的关注区域离赤道很远的时候,因为误差可能会很大.

one final note: don't use these algorithms with lon/lat input, particularly if your area of interest is far from equator, because the error can get huge.

更新:

要投影坐标,您需要从WGS84 (lon/lat)转换为适当的UTM.要找出哪个utm区域,您应该计划使用 epsg.io .

To project your coordinates, you need to convert from WGS84 (lon/lat) to appropriate UTM. To find out which utm zone you should project to use epsg.io.

lon = -122.67598
lat = 45.52168
WGS84 = "+init=EPSG:4326"
EPSG3740 = "+init=EPSG:3740"
Proj_to_EPSG3740 = pyproj.Proj(EPSG3740)

Proj_to_EPSG3740(lon,lat)
# out: (525304.9265963673, 5040956.147893889)

您可以执行df.apply()并使用Proj_to_...来投影df.

You can do df.apply() and use Proj_to_... to project df.

这篇关于向量化以计算许多距离的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆