numpy-查找矩阵乘法的最近邻居 [英] Numpy - Finding Nearest Neighbors of a Matrix Multiplication

查看:71
本文介绍了numpy-查找矩阵乘法的最近邻居的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个包含1000个128维特征的数据集,其形状为(1000,128).

I have a dataset of a thousand 128 dimensional features in the shape of e.g. (1000,128).

我想找到形状为(128,1)的128维特征的排序最近的邻居.

I want to find the sorted nearest neighbors of a 128 dimensional feature in the shape of (128,1).

通过数据集(1000,128)与要素(128,1)之间的矩阵乘法计算的距离,将得出形状为(1000,1)的相似数组:

The distance in calculated via a Matrix Multiplication between dataset (1000,128) and feature (128,1) which would give an array of similarities in the shape of (1000,1) :

这是通过以下方式完成的:

This is done via:

# features.shape=(1000,128) ; feature.shape=(128,1) ; similarities.shape=(1000,1)
similarities = features.dot(feature)

计算距离(相似度)后,我使用以下代码查找最近的邻居:

After calculating the distance (similarities), I'm finding the nearest neighbors using the code below:

# The n Nearest Neighbors Indexes (But Not Sorted)
nearest_neighbours_indexes_unsorted = np.argpartition(similarities, kth=-n)[-n:]

# The n Nearest Neighbors (But Not Sorted)
nearest_neighbours_similarities_unsorted = similarities[nearest_neighbours_indexes_unsorted]

# The Indexes of n Nearest Neighbors Sorted
nearest_neighbours_indexes_sorted = np.flip(nearest_neighbours_indexes_unsorted[np.argsort(nearest_neighbours_similarities_unsorted)], axis=0)

此代码可非常快速地处理数百万个数据(我想知道是否有人提出使它更快的技巧),但我希望能够一次找到多个功能的最近邻居:

This code works very fast for millions of data (I'm interested if someone has a tip to make it faster) But I want to be able to find the nearest neighbors of more than one feature in one go:

一种方法是为循环中的每个功能计算上述代码(这很慢),另一种方法是更改​​代码以适应多维索引,这就是我遇到的问题:我不知道如何将上面的代码编写为形状为(128,n)而不是(128,1)的特征.

One way is to calculate the above code for each feature in a loop (which is slow) and the other way is to change the code to accommodate for multidimensional indexing and here's where I'm stuck: I don't know how to write the above code for features in the shape of (128,n) and not (128,1).

推荐答案

Helper函数可沿轴获取最大,最小的n索引元素

这是一个帮助函数,可以使用n-largest索引. argpartition.html"rel =" nofollow noreferrer> np.argpartition np.take_along_axis -

Helper functions to get largest, smallest n-indices, elements along an axis

Here's a helper function to select top n-largest indices along a generic axis from a generic ndarray making use of np.argpartition and np.take_along_axis -

def take_largest_indices_along_axis(ar, n, axis):    
    s = ar.ndim*[slice(None,None,None)]
    s[axis] = slice(-n,None,None)
    idx = np.argpartition(ar, kth=-n, axis=axis)[tuple(s)]
    sidx = np.take_along_axis(ar,idx, axis=axis).argsort(axis=axis)
    return np.flip(np.take_along_axis(idx, sidx, axis=axis),axis=axis)

将其扩展以获得n个最小的索引-

Extending this to get n-smallest indices -

def take_smallest_indices_along_axis(ar, n, axis):    
    s = ar.ndim*[slice(None,None,None)]
    s[axis] = slice(None,n,None)
    idx = np.argpartition(ar, kth=n, axis=axis)[tuple(s)]
    sidx = np.take_along_axis(ar,idx, axis=axis).argsort(axis=axis)
    return np.take_along_axis(idx, sidx, axis=axis)

并扩展这些元素以选择最大或最小的n元素本身,只需使用np.take_along_axis即可,如下所示-

And extending these to select the largest or smallest n elements themselves, it would be with a simple usage of np.take_along_axis as listed next -

def take_largest_along_axis(ar, n, axis):
    idx = take_largest_indices_along_axis(ar, n, axis)
    return np.take_along_axis(ar, idx, axis=axis)

def take_smallest_along_axis(ar, n, axis):
    idx = take_smallest_indices_along_axis(ar, n, axis)
    return np.take_along_axis(ar, idx, axis=axis)

示例运行

# Sample setup
In [200]: np.random.seed(0)
     ...: ar = np.random.randint(0,99,(5,5))

In [201]: ar
Out[201]: 
array([[44, 47, 64, 67, 67],
       [ 9, 83, 21, 36, 87],
       [70, 88, 88, 12, 58],
       [65, 39, 87, 46, 88],
       [81, 37, 25, 77, 72]])

采用最大的n索引,即沿轴的元素-

Take largest n indices, elements along axis -

In [202]: take_largest_indices_along_axis(ar, n=2, axis=0)
Out[202]: 
array([[4, 2, 2, 4, 3],
       [2, 1, 3, 0, 1]])

In [203]: take_largest_indices_along_axis(ar, n=2, axis=1)
Out[203]: 
array([[4, 3],
       [4, 1],
       [2, 1],
       [4, 2],
       [0, 3]])

In [251]: take_largest_along_axis(ar, n=2, axis=0)
Out[251]: 
array([[81, 88, 88, 77, 88],
       [70, 83, 87, 67, 87]])

In [252]: take_largest_along_axis(ar, n=2, axis=1)
Out[252]: 
array([[67, 67],
       [87, 83],
       [88, 88],
       [88, 87],
       [81, 77]])

采用最小的n索引,即沿轴的元素-

Take smallest n indices, elements along axis -

In [232]: take_smallest_indices_along_axis(ar, n=2, axis=0)
Out[232]: 
array([[1, 4, 1, 2, 2],
       [0, 3, 4, 1, 0]])

In [233]: take_smallest_indices_along_axis(ar, n=2, axis=1)
Out[233]: 
array([[0, 1],
       [0, 2],
       [3, 4],
       [1, 3],
       [2, 1]])

In [253]: take_smallest_along_axis(ar, n=2, axis=0)
Out[253]: 
array([[ 9, 37, 21, 12, 58],
       [44, 39, 25, 36, 67]])

In [254]: take_smallest_along_axis(ar, n=2, axis=1)
Out[254]: 
array([[44, 47],
       [ 9, 21],
       [12, 58],
       [39, 46],
       [25, 37]])


在这里解决问题

对于我们的情况,我们假设输入为similarities且其形状为(1000,128)表示1000个数据点和128个特征,并且我们想为每个数据点寻找最大的n=10特征,然后会是-


Solving our case here

For our case, let's assume the input is similarities and is of shape (1000,128) representing 1000 data points and 128 features and that we want to look for largest say n=10 features for each of those data points, then it would be -

take_largest_indices_along_axis(similarities, n=10, axis=1) # indices
take_largest_along_axis(similarities, n=10, axis=1) # elements

最终索引/值数组的形状为(1000, n).

The final indices/values array would be of shape (1000, n).

以给定的数据集形状运行样本-

Sample run with the given dataset shape -

In [257]: np.random.seed(0)
     ...: similarities = np.random.randint(0,99,(1000,128))

In [263]: take_largest_indices_along_axis(similarities, n=10, axis=1).shape
Out[263]: (1000, 10)

In [264]: take_largest_along_axis(similarities, n=10, axis=1).shape
Out[264]: (1000, 10)

如果相反,您希望获得每个功能的最大n个数据点,即最终索引/值数组的形状为(n, 128),则使用axis=0.

If instead you were looking to get n largest data-points for each of those features, that is the final indices/values array would be of shape (n, 128), then use axis=0.

这篇关于numpy-查找矩阵乘法的最近邻居的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆