在多个维度上有效地找到邻居,并根据邻近度计算值的总和 [英] Efficiently find neighbors on multiple dimensions and calculate sum of values based on proximity

查看:42
本文介绍了在多个维度上有效地找到邻居,并根据邻近度计算值的总和的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我的任务是查找中心元素可变距离内所有元素的总值.元素使用3维(我的数据中的列)进行排列.给定3个维度,每个元素都有一个唯一的位置(并具有唯一的ID).

I am tasked with finding the total value of all elements within a variable distance of a central element. The elements are arranged using 3 dimensions (columns in my data). Each element has a unique location given the 3 dimensions (and has a unique-id).

我有一个可以满足我需要的工作版本,但是它的运行速度非常慢.我正在使用itertuples,使用子集数据框,apply(np.isclose)查找每个元组的值,并使用.at设置值(请参见下面的代码).

I have a working version that does what I want, however it is terribly slow. I am using itertuples, finding the value per tuple using a subset dataframe, apply(np.isclose), and I set the value with .at (see code below).

问题不仅仅在于代码的功能,还在于可伸缩性.由于我想设置一个可变的距离来测量,并且我想为每一行计算该值,因此最终迭代nrows x ndistances,当前每次迭代需要1.7秒(我的数据有> 25,000行,我估计大约需要12个小时我尝试的每个距离).

The problem is not so much the function of my code as it is the scalability. Since I want to set a variable distance to measure, and I want to calculate this value for each row, it ends up iterating nrows x ndistances, and currently each iteration takes 1.7 seconds (my data has >25,000 rows, I estimated ~12 hours per each distance I try).

import pandas as pd
import numpy as np

数据结构示例:

df = pd.DataFrame({'id':[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19], 
                          'x':[-2,-2,-2,-1,-1,-1,-1,0,0,0,0,0,1,1,1,1,2,2,2], 
                          'y':[2,1,0,2,1,0,-1,2,1,0,-1,-2,1,0,-1,-2,0,-1,-2], 
                          'z':[0,1,2,-1,0,1,2,-2,-1,0,1,2,-2,-1,0,1,-2,-1,0], 
                          'val':[0,0,0,1,0,0,6,3,7,11,0,0,14,18,10,4,20,15,2]})
df.set_index('id', inplace=True)
# The 'val' column can have any non-negative whole number, I've just picked some randomly.

到目前为止,

有效"代码:

'Working' code so far:

n = 0  #Initial distance
while n < 3:  #This part allows me to set my distance range
    df['n{0}'.format(n)] = np.nan  #create a column for the new values
    for row in df.itertuples():
        valsum = df[(df['x'].apply(np.isclose, b=row.x, atol=n)) & 
                    (df['y'].apply(np.isclose, b=row.y, atol=n)) & 
                    (df['z'].apply(np.isclose, b=row.z, atol=n))].val.sum()
        df.at[row.Index, 'n{0}'.format(n)] = valsum
    n += 1

当前/所需的输出:

    x   y   z   val n0  n1  n2
id                          
1   -2  2   0   0   0   1   22
2   -2  1   1   0   0   0   25
3   -2  0   2   0   0   6   17
4   -1  2   -1  1   1   11  54
5   -1  1   0   0   0   19  70
6   -1  0   1   0   0   17  57
7   -1  -1  2   6   6   6   31
8   0   2   -2  3   3   25  74
9   0   1   -1  7   7   54  99
10  0   0   0   11  11  46  111
11  0   -1  1   0   0   31  73
12  0   -2  2   0   0   10  33
13  1   1   -2  14  14  62  99
14  1   0   -1  18  18  95  105
15  1   -1  0   10  10  60  107
16  1   -2  1   4   4   16  66
17  2   0   -2  20  20  67  100
18  2   -1  -1  15  15  65  101
19  2   -2  0   2   2   31  80

我知道具有'n0'列等于'val'列,因为搜索距离为0,但我希望希望显示出我要查找的内容.val列中所有项目的总和为111,当(x,y,z)=(0,0,0)时相同.这是因为在此示例中,(0,0,0)是我的数据的中心,因此距离为2会捕获所有元素.我想在一定距离范围内执行此操作,例如5-10.

I know that having the 'n0' column is equal to 'val' column, because the search distance is 0, but I wanted to hopefully show what I am looking for. The sum of all the items in the val column is 111, which is the same when (x,y,z) = (0,0,0). This is because (0,0,0) is the center of my data in this example, and therefore having a distance of 2 captures all of the elements. I'd like to do this for a bandwidth of distances, say, 5-10.

我的最终问题是:如何才能做到这一点,但要更快/更有效?

My ultimate question is: How can I do this but faster / more efficiently?

推荐答案

在k维空间内查找最近的邻居是kd树数据结构的经典案例( docs )之所以在下面使用,是因为您的问题中使用的条件逻辑似乎定义了Chebyshev距离度量标准( Wikipedia ),这是scikit-learn本身支持的.SciPy的 cKDTree ( docs C ++源代码)仅支持欧几里德(L2)距离度量标准,但已对其进行了优化,因此可能更快.

Finding nearest neighbours within k-dimensional space is a classic case for the k-d tree data structure (Wikipedia). Scikit-learn has a flexible implementation (docs) which I use below, since the conditional logic used in your question seems to define the Chebyshev distance metric (Wikipedia), which scikit-learn supports natively. SciPy's cKDTree (docs, C++ source code) supports only the Euclidean (L2) distance metric, but is optimized for it, and thus might be faster.

# Setup
df = pd.DataFrame({'id':[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19], 
                   'x':[-2,-2,-2,-1,-1,-1,-1,0,0,0,0,0,1,1,1,1,2,2,2], 
                   'y':[2,1,0,2,1,0,-1,2,1,0,-1,-2,1,0,-1,-2,0,-1,-2], 
                   'z':[0,1,2,-1,0,1,2,-2,-1,0,1,2,-2,-1,0,1,-2,-1,0], 
                   'val':[0,0,0,1,0,0,6,3,7,11,0,0,14,18,10,4,20,15,2]})
df.set_index('id', inplace=True)


from sklearn.neighbors import KDTree

# Build k-d tree with the Chebyshev metric, AKA L-infinity
tree = KDTree(df[['x', 'y', 'z']].values, metric='chebyshev')

for radius in [0, 1, 2]:
    # Populate new column with placeholder integer
    df[f'n{radius}'] = -1
    for i, row in df.iterrows():
        coords = row[['x', 'y', 'z']].values.reshape(1, -1)
        idx = tree.query_radius(coords, r=radius)[0]
        df.loc[i, f'n{radius}'] = df.iloc[idx]['val'].sum()

df
    x  y  z  val  n0  n1   n2
id                           
1  -2  2  0    0   0   1   22
2  -2  1  1    0   0   0   25
3  -2  0  2    0   0   6   17
4  -1  2 -1    1   1  11   54
5  -1  1  0    0   0  19   70
6  -1  0  1    0   0  17   57
7  -1 -1  2    6   6   6   31
8   0  2 -2    3   3  25   74
9   0  1 -1    7   7  54   99
10  0  0  0   11  11  46  111
11  0 -1  1    0   0  31   73
12  0 -2  2    0   0  10   33
13  1  1 -2   14  14  62   99
14  1  0 -1   18  18  95  105
15  1 -1  0   10  10  60  107
16  1 -2  1    4   4  16   66
17  2  0 -2   20  20  67  100
18  2 -1 -1   15  15  65  101
19  2 -2  0    2   2  31   80

这篇关于在多个维度上有效地找到邻居,并根据邻近度计算值的总和的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆