如何进行有效的k近邻计算在Matlab [英] How to do efficient k-nearest neighbor calculation in Matlab

查看:170
本文介绍了如何进行有效的k近邻计算在Matlab的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在做什么用k近邻算法在Matlab中的数据分析。我的数据包括约11795 x 88数据矩阵,其中的行的意见和列变量。

我的任务是找到k最近邻居N的测试点。目前,我做它与以下逻辑:

  

对于所有的测试点

 循环中的所有数据,并发现K-最近的邻居(由欧氏距离)
 

换句话说,我环路所有N个测试点。对于每一个测试点我搜索的数据(不包括测试点本身)对于K最近的邻居欧氏距离。对于每一个测试点这大约需要KX 11794迭代。所以整个过程大约需要nxkx 11794迭代。如果n = 10000和k = 7,这将是约825,6百万次迭代。

有没有计算第k近邻更有效的方法?大部分的计算是现在要浪费,因为我的算法简单:

计算的欧几里得距离的所有其他点,拾取最接近,并排除在进一步的考虑最近点 - >计算的欧几里得距离的所有其他点和拾取最靠近 - >等 - >等等。

有一个聪明的办法来摆脱这种浪费计算?

目前这个过程大约需要7个小时我的电脑(3.2千兆赫,8 GB内存,64位的Win 7)...:(

下面是一些明确说明的逻辑(这不是我的所有code,但是这是吃了性能的部分):

 对于i = 1:尺寸(测试点,1)%循环中的所有测试点
    neighborca​​ndidates = all_data_excluding_testpoints; %使用其他数据不包括测试点寻找第k近邻
    测试点测试点=(I,:); %这是一个测试点,我们发现K-近邻
    kneighbors = []; %在这里存放k最近邻居。
    对于j = 1:K%得到k近邻
        bdist =天道酬勤; %最接近的邻居的距离
        绑定= 0; %在最近的邻居的索引
        对于n = 1:尺寸(neighborca​​ndidates,1)%回路的所有候选人
            如果pdist([测试点; neighborca​​ndidates(N,:)])< bdist%检查的欧氏距离
                bdist = pdist([测试点; neighborca​​ndidates(N,:)]); %更新最佳距离,到目前为止
                绑定= N; %保存最好的发现指数至今
            结束
        结束
        kneighbors = [kneighbors; neighborca​​ndidates(绑定,:)]。 %保存发现邻居
        neighborca​​ndidates(绑定,:) = []; %拆下进一步考虑邻居
    结束
结束
 

解决方案

使用 pdist2

  A =兰特(20,5); %//这是你的11795 x 88
B = A([1,12,4,8],:); %//这是您的正由-88子集,即n = 4时在此情况下
N =尺寸(B,1);

D = pdist2(A,B);
[〜,IND] =排序(D);
kneighbours = IND(2:2 + K,:);
 

现在,你可以在 A 使用 kneighbours 来索引行。需要注意的是 kneighbours 的列对应于 B

的行

但既然你已经浸入统计工具箱, pdist 为什么不直接使用Matlab的 knnsearch

  kneighbours_matlab = knnsearch(A,B,'K',K + 1);
 

请注意, kneighbours 是一样的 kneighbours_matlab(:,2:结束)

I'm doing data analysis using k-nearest neighbor algorithm in Matlab. My data consists of about 11795 x 88 data matrix, where the rows are observations and columns are variables.

My task is to find k-nearest neighbors for n selected test points. Currently I'm doing it with the following logic:

FOR all the test points

   LOOP all the data and find the k-closest neighbors (by euclidean distance)

In other words, I loop all the n test points. For each test point I search the data (which excludes the test point itself) for k-nearest neighbors by euclidean distance. For each test point this takes approximately k x 11794 iterations. So the whole process takes about n x k x 11794 iterations. If n = 10000 and k = 7, this would be approximately 825,6 million iterations.

Is there a more efficient way to calculate the k-nearest neighbors? Most of the computation is going to waste now, because my algorithm simply:

calculates the euclidean distance to all the other points, picks up the closest and excludes the closest point from further consideration --> calculates the euclidean distance to all the other points and picks up the closest --> etc. --> etc.

Is there a smart way to get rid of this 'waste calculation'?

Currently this process takes about 7 hours in my computer (3.2 GHz, 8 GB RAM, 64-bit Win 7)... :(

Here is some of the logic illustrated explicitly (this is not all my code, but this is the part that eats up performance):

for i = 1:size(testpoints, 1) % Loop all the test points 
    neighborcandidates = all_data_excluding_testpoints; % Use the rest of the data excluding the test points in search of the k-nearest neighbors 
    testpoint = testpoints(i, :); % This is the test point for which we find k-nearest neighbors
    kneighbors = []; % Store the k-nearest neighbors here.
    for j = 1:k % Find k-nearest neighbors
        bdist = Inf; % The distance of the closest neighbor
        bind = 0; % The index of the closest neighbor
        for n = 1:size(neighborcandidates, 1) % Loop all the candidates
            if pdist([testpoint; neighborcandidates(n, :)]) < bdist % Check the euclidean distance
                bdist = pdist([testpoint; neighborcandidates(n, :)]); % Update the best distance so far
                bind = n; % Save the best found index so far
            end
        end
        kneighbors = [kneighbors; neighborcandidates(bind, :)]; % Save the found neighbour
        neighborcandidates(bind, :) = []; % Remove the neighbor from further consideration 
    end
end

解决方案

Using pdist2:

A = rand(20,5);             %// This is your 11795 x 88
B = A([1, 12, 4, 8], :);    %// This is your n-by-88 subset, i.e. n=4 in this case
n = size(B,1);

D = pdist2(A,B);
[~, ind] = sort(D);
kneighbours = ind(2:2+k, :);

Now you can use kneighbours to index a row in A. Note that the columns of kneighbours correspond to the rows of B

But since you're already dipping into the stats toolbox with pdist why not just use Matlab's knnsearch?

kneighbours_matlab = knnsearch(A,B,'K',k+1);

note that kneighbours is the same as kneighbours_matlab(:,2:end)'

这篇关于如何进行有效的k近邻计算在Matlab的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆