寻找K-最近邻及其实现 [英] Finding K-nearest neighbors and its implementation

查看:22
本文介绍了寻找K-最近邻及其实现的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用具有欧几里德距离的 KNN 对简单数据进行分类.我已经看到了一个例子,说明我想要做的是使用 MATLAB knnsearch 函数完成的,如下所示:

I am working on classifying simple data using KNN with Euclidean distance. I have seen an example on what I would like to do that is done with the MATLAB knnsearch function as shown below:

load fisheriris 
x = meas(:,3:4);
gscatter(x(:,1),x(:,2),species)
newpoint = [5 1.45];
[n,d] = knnsearch(x,newpoint,'k',10);
line(x(n,1),x(n,2),'color',[.5 .5 .5],'marker','o','linestyle','none','markersize',10)

上面的代码采用一个新点,即 [5 1.45] 并找到与新点最接近的 10 个值.谁能给我看一个 MATLAB 算法,详细解释 knnsearch 函数的作用?有没有其他方法可以做到这一点?

The above code takes a new point i.e. [5 1.45] and finds the 10 closest values to the new point. Can anyone please show me a MATLAB algorithm with a detailed explanation of what the knnsearch function does? Is there any other way to do this?

推荐答案

K-最近邻 (KNN) 算法的基础是您有一个由 N 行和 M 列,其中 N 是我们拥有的数据点的数量,而 M 是每个数据点的维度.例如,如果我们将笛卡尔坐标放在一个数据矩阵中,这通常是一个 N x 2 或一个 N x 3 矩阵.使用此数据矩阵,您提供一个查询点并在此数据矩阵中搜索最接近此查询点的 k 个点.

The basis of the K-Nearest Neighbour (KNN) algorithm is that you have a data matrix that consists of N rows and M columns where N is the number of data points that we have, while M is the dimensionality of each data point. For example, if we placed Cartesian co-ordinates inside a data matrix, this is usually a N x 2 or a N x 3 matrix. With this data matrix, you provide a query point and you search for the closest k points within this data matrix that are the closest to this query point.

我们通常使用查询与数据矩阵中其余点之间的欧几里德距离来计算我们的距离.但是,也使用其他距离,如 L1 或城市街区/曼哈顿距离.在此操作之后,您将拥有 N 欧几里得或曼哈顿距离,它们象征着查询与数据集中每个对应点之间的距离.找到这些后,您只需通过按升序对距离进行排序并检索数据之间距离最小的那些 k 个点来搜索距查询最近的 k 个点设置和查询.

We usually use the Euclidean distance between the query and the rest of your points in your data matrix to calculate our distances. However, other distances like the L1 or the City-Block / Manhattan distance are also used. After this operation, you will have N Euclidean or Manhattan distances which symbolize the distances between the query with each corresponding point in the data set. Once you find these, you simply search for the k nearest points to the query by sorting the distances in ascending order and retrieving those k points that have the smallest distance between your data set and the query.

假设您的数据矩阵存储在 x 中,并且 newpoint 是一个样本点,它具有 M 列(即 1x M),这是点形式的一般程序:

Supposing your data matrix was stored in x, and newpoint is a sample point where it has M columns (i.e. 1 x M), this is the general procedure you would follow in point form:

  1. newpointx 中每个点之间的欧几里得或曼哈顿距离.
  2. 按升序对这些距离进行排序.
  3. 返回x中最接近newpointk个数据点.
  1. Find the Euclidean or Manhattan distance between newpoint and every point in x.
  2. Sort these distances in ascending order.
  3. Return the k data points in x that are closest to newpoint.

让我们慢慢地做每一步.

Let's do each step slowly.

某人可能会这样做的一种方式可能是在 for 循环中,如下所示:

One way that someone may do this is perhaps in a for loop like so:

N = size(x,1);
dists = zeros(N,1);
for idx = 1 : N
    dists(idx) = sqrt(sum((x(idx,:) - newpoint).^2));
end

如果你想实现曼哈顿距离,这将是:

If you wanted to implement the Manhattan distance, this would simply be:

N = size(x,1);
dists = zeros(N,1);
for idx = 1 : N
    dists(idx) = sum(abs(x(idx,:) - newpoint));
end

dists 将是一个 N 元素向量,包含 xnewpoint 中每个数据点之间的距离.我们在 newpointx 中的一个数据点之间做一个逐个元素的减法,平方差,然后 sum 将它们放在一起.这个和然后被平方根,这完成了欧几里得距离.对于曼哈顿距离,您将逐个元素执行减法,取绝对值,然后将所有分量相加.这可能是最容易理解的实现,但它可能是最低效的……尤其是对于较大的数据集和较大的数据维度.

dists would be a N element vector that contains the distances between each data point in x and newpoint. We do an element-by-element subtraction between newpoint and a data point in x, square the differences, then sum them all together. This sum is then square rooted, which completes the Euclidean distance. For the Manhattan distance, you would perform an element by element subtraction, take the absolute values, then sum all of the components together. This is probably the most simplest of the implementations to understand, but it could possibly be the most inefficient... especially for larger sized data sets and larger dimensionality of your data.

另一种可能的解决方案是复制 newpoint 并使该矩阵与 x 的大小相同,然后对该矩阵进行逐个元素的减法,然后求和在每行的所有列上并做平方根.因此,我们可以这样做:

Another possible solution would be to replicate newpoint and make this matrix the same size as x, then doing an element-by-element subtraction of this matrix, then summing over all of the columns for each row and doing the square root. Therefore, we can do something like this:

N = size(x, 1);
dists = sqrt(sum((x - repmat(newpoint, N, 1)).^2, 2));

对于曼哈顿距离,你会这样做:

For the Manhattan distance, you would do:

N = size(x, 1);
dists = sum(abs(x - repmat(newpoint, N, 1)), 2);

repmat 需要一个矩阵或向量并在给定方向重复它们一定次数.在我们的例子中,我们想要获取我们的 newpoint 向量,并将这个 N 次堆叠在一起以创建一个 N x M 矩阵,其中每一行的长度为 M 个元素.我们将这两个矩阵相减,然后对每个分量求平方.一旦我们这样做了,我们对每一行的所有列进行sum,最后取所有结果的平方根.对于曼哈顿距离,我们做减法,取绝对值,然后求和.

repmat takes a matrix or vector and repeats them a certain amount of times in a given direction. In our case, we want to take our newpoint vector, and stack this N times on top of each other to create a N x M matrix, where each row is M elements long. We subtract these two matrices together, then square each component. Once we do this, we sum over all of the columns for each row and finally take the square root of all result. For the Manhattan distance, we do the subtraction, take the absolute value and then sum.

但是,在我看来,最有效的方法是使用 bsxfun.这基本上通过单个函数调用完成了我们在后台讨论的复制.因此,代码将是这样的:

However, the most efficient way to do this in my opinion would be to use bsxfun. This essentially does the replication that we talked about under the hood with a single function call. Therefore, the code would simply be this:

dists = sqrt(sum(bsxfun(@minus, x, newpoint).^2, 2));

对我来说,这看起来更简洁明了.对于曼哈顿距离,你会这样做:

To me this looks much cleaner and to the point. For the Manhattan distance, you would do:

dists = sum(abs(bsxfun(@minus, x, newpoint)), 2);


步骤#2

既然我们有了距离,我们只需对它们进行排序.我们可以使用 sort 对我们的距离:


Step #2

Now that we have our distances, we simply sort them. We can use sort to sort our distances:

[d,ind] = sort(dists);

d 将包含按升序排序的距离,而 ind 告诉您 unsorted 数组中出现在排序结果.我们需要使用ind,提取这个向量的前k个元素,然后使用ind索引到我们的x 返回那些最接近 newpoint 的点的数据矩阵.

d would contain the distances sorted in ascending order, while ind tells you for each value in the unsorted array where it appears in the sorted result. We need to use ind, extract the first k elements of this vector, then use ind to index into our x data matrix to return those points that were the closest to newpoint.

最后一步是现在返回那些最接近 newpointk 数据点.我们可以非常简单地做到这一点:

The final step is to now return those k data points that are closest to newpoint. We can do this very simply by:

ind_closest = ind(1:k);
x_closest = x(ind_closest,:);

ind_closest 应该包含原始数据矩阵 x 中最接近 newpoint 的索引.具体来说,ind_closest 包含您需要从 x 中采样的,以获得最接近 newpoint 的点.x_closest 将包含那些实际数据点.

ind_closest should contain the indices in the original data matrix x that are the closest to newpoint. Specifically, ind_closest contains which rows you need to sample from in x to obtain the closest points to newpoint. x_closest will contain those actual data points.

为了您复制和粘贴的乐趣,代码如下所示:

For your copying and pasting pleasure, this is what the code looks like:

dists = sqrt(sum(bsxfun(@minus, x, newpoint).^2, 2));
%// Or do this for Manhattan
% dists = sum(abs(bsxfun(@minus, x, newpoint)), 2);
[d,ind] = sort(dists);
ind_closest = ind(1:k);
x_closest = x(ind_closest,:);


运行您的示例,让我们看看我们的代码:


Running through your example, let's see our code in action:

load fisheriris 
x = meas(:,3:4);
newpoint = [5 1.45];
k = 10;

%// Use Euclidean
dists = sqrt(sum(bsxfun(@minus, x, newpoint).^2, 2));
[d,ind] = sort(dists);
ind_closest = ind(1:k);
x_closest = x(ind_closest,:);

通过检查 ind_closestx_closest,我们得到:

By inspecting ind_closest and x_closest, this is what we get:

>> ind_closest

ind_closest =

   120
    53
    73
   134
    84
    77
    78
    51
    64
    87

>> x_closest

x_closest =

    5.0000    1.5000
    4.9000    1.5000
    4.9000    1.5000
    5.1000    1.5000
    5.1000    1.6000
    4.8000    1.4000
    5.0000    1.7000
    4.7000    1.4000
    4.7000    1.4000
    4.7000    1.5000

如果您运行 knnsearch,您将看到您的变量 nind_closest 匹配.但是,变量d 返回从newpoint 到每个点x距离,而不是实际数据点本身.如果你想要实际距离,只需在我写的代码之后执行以下操作:

If you ran knnsearch, you will see that your variable n matches up with ind_closest. However, the variable d returns the distances from newpoint to each point x, not the actual data points themselves. If you want the actual distances, simply do the following after the code I wrote:

dist_sorted = d(1:k);


请注意,上述答案仅使用了一批 N 示例中的一个查询点.KNN 经常同时用于多个示例.假设我们有想要在 KNN 中测试的 Q 查询点.这将导致一个 kx M x Q 矩阵,其中对于每个示例或每个切片,我们返回 k 个维度为 M 的最近点.或者,我们可以返回 k 个最近点的 ID,从而得到一个 Q x k 矩阵.让我们计算两者.


Note that the above answer uses only one query point in a batch of N examples. Very frequently KNN is used on multiple examples simultaneously. Supposing that we have Q query points that we want to test in the KNN. This would result in a k x M x Q matrix where for each example or each slice, we return the k closest points with a dimensionality of M. Alternatively, we can return the IDs of the k closest points thus resulting in a Q x k matrix. Let's compute both.

一种简单的方法是将上述代码应用到一个循环中并遍历每个示例.

A naive way to do this would be to apply the above code in a loop and loop over every example.

当我们分配一个 Q xk 矩阵并应用基于 bsxfun 的方法将输出矩阵的每一行设置为 k 数据集中最近的点,我们将在这里像以前一样使用 Fisher Iris 数据集.我们还将保持与前一个示例中相同的维度,我将使用四个示例,因此 Q = 4M = 2:

Something like this would work where we allocate a Q x k matrix and apply the bsxfun based approach to set each row of the output matrix to the k closest points in the dataset, where we will use the Fisher Iris dataset just like what we had before. We'll also keep the same dimensionality as we did in the previous example and I'll use four examples, so Q = 4 and M = 2:

%// Load the data and create the query points
load fisheriris;
x = meas(:,3:4);
newpoints = [5 1.45; 7 2; 4 2.5; 2 3.5];

%// Define k and the output matrices
Q = size(newpoints, 1);
M = size(x, 2);
k = 10;
x_closest = zeros(k, M, Q);
ind_closest = zeros(Q, k);

%// Loop through each point and do logic as seen above:
for ii = 1 : Q
    %// Get the point
    newpoint = newpoints(ii, :);

    %// Use Euclidean
    dists = sqrt(sum(bsxfun(@minus, x, newpoint).^2, 2));
    [d,ind] = sort(dists);

    %// New - Output the IDs of the match as well as the points themselves
    ind_closest(ii, :) = ind(1 : k).';
    x_closest(:, :, ii) = x(ind_closest(ii, :), :);
end

虽然这很好,但我们可以做得更好.有一种方法可以有效地计算两组向量之间的平方欧几里得距离.如果你想在曼哈顿做到这一点,我会把它留作练习.咨询 这个博客,假设A 是一个Q1 x M 矩阵,其中每一行都是一个维度MQ1 点和 B 是一个 Q2 x M 矩阵,其中每一行也是一个 M 维数点Q2 点,我们可以有效地计算距离矩阵 D(i, j) 其中第 i 行和 j<列的元素/code> 使用以下矩阵公式表示 Ai 行和 Bj 行之间的距离:

Though this is very nice, we can do even better. There is a way to efficiently compute the squared Euclidean distance between two sets of vectors. I'll leave it as an exercise if you want to do this with the Manhattan. Consulting this blog, given that A is a Q1 x M matrix where each row is a point of dimensionality M with Q1 points and B is a Q2 x M matrix where each row is also a point of dimensionality M with Q2 points, we can efficiently compute a distance matrix D(i, j) where the element at row i and column j denotes the distance between row i of A and row j of B using the following matrix formulation:

nA = sum(A.^2, 2); %// Sum of squares for each row of A
nB = sum(B.^2, 2); %// Sum of squares for each row of B
D = bsxfun(@plus, nA, nB.') - 2*A*B.'; %// Compute distance matrix
D = sqrt(D); %// Compute square root to complete calculation

因此,如果我们让 A 为查询点矩阵,B 为原始数据组成的数据集,我们可以确定 k 通过对每一行单独排序并确定每一行中最小的 k 个位置来最近点.我们还可以另外使用它来检索实际点本身.

Therefore, if we let A be a matrix of query points and B be the dataset consisting of your original data, we can determine the k closest points by sorting each row individually and determining the k locations of each row that were the smallest. We can also additionally use this to retrieve the actual points themselves.

因此:

%// Load the data and create the query points
load fisheriris;
x = meas(:,3:4);
newpoints = [5 1.45; 7 2; 4 2.5; 2 3.5];

%// Define k and other variables
k = 10;
Q = size(newpoints, 1);
M = size(x, 2);

nA = sum(newpoints.^2, 2); %// Sum of squares for each row of A
nB = sum(x.^2, 2); %// Sum of squares for each row of B
D = bsxfun(@plus, nA, nB.') - 2*newpoints*x.'; %// Compute distance matrix
D = sqrt(D); %// Compute square root to complete calculation 

%// Sort the distances 
[d, ind] = sort(D, 2);

%// Get the indices of the closest distances
ind_closest = ind(:, 1:k);

%// Also get the nearest points
x_closest = permute(reshape(x(ind_closest(:), :).', M, k, []), [2 1 3]);

我们看到我们使用的计算距离矩阵的逻辑是相同的,但一些变量已更改以适应示例.我们还使用 sort 的两个输入版本对每一行进行独立排序,因此 ind 将包含每行的 ID,d 将包含相应的距离.然后我们通过简单地将该矩阵截断为 k 列,找出最接近每个查询点的索引.然后我们使用 permutereshape 以确定关联的最近点是什么.我们首先使用所有最接近的索引并创建一个点矩阵,该矩阵将所有 ID 堆叠在彼此的顶部,因此我们得到一个 Q * k x M 矩阵.使用 reshapepermute 允许我们创建我们的 3D 矩阵,使其成为我们指定的 k x M x Q 矩阵.如果您想自己获得实际距离,我们可以索引 d 并获取我们需要的内容.为此,您需要使用 sub2ind 获得线性索引,以便我们可以一次性索引到 d.ind_closest 的值已经给了我们需要访问的列.我们需要访问的行只是 1、k 次、2、k 次,等等,直到 Q.k 是我们想要返回的点数:

We see that we used the logic for computing the distance matrix is the same but some variables have changed to suit the example. We also sort each row independently using the two input version of sort and so ind will contain the IDs per row and d will contain the corresponding distances. We then figure out which indices are the closest to each query point by simply truncating this matrix to k columns. We then use permute and reshape to determine what the associated closest points are. We first use all of the closest indices and create a point matrix that stacks all of the IDs on top of each other so we get a Q * k x M matrix. Using reshape and permute allows us to create our 3D matrix so that it becomes a k x M x Q matrix like we have specified. If you wanted to get the actual distances themselves, we can index into d and grab what we need. To do this, you will need to use sub2ind to obtain the linear indices so we can index into d in one shot. The values of ind_closest already give us which columns we need to access. The rows we need to access are simply 1, k times, 2, k times, etc. up to Q. k is for the number of points we wanted to return:

row_indices = repmat((1:Q).', 1, k);
linear_ind = sub2ind(size(d), row_indices, ind_closest);
dist_sorted = D(linear_ind);

当我们对上面的查询点运行上面的代码时,这些是我们得到的索引、点和距离:

When we run the above code for the above query points, these are the indices, points and distances we get:

>> ind_closest

ind_closest =

   120   134    53    73    84    77    78    51    64    87
   123   119   118   106   132   108   131   136   126   110
   107    62    86   122    71   127   139   115    60    52
    99    65    58    94    60    61    80    44    54    72

>> x_closest

x_closest(:,:,1) =

    5.0000    1.5000
    6.7000    2.0000
    4.5000    1.7000
    3.0000    1.1000
    5.1000    1.5000
    6.9000    2.3000
    4.2000    1.5000
    3.6000    1.3000
    4.9000    1.5000
    6.7000    2.2000


x_closest(:,:,2) =

    4.5000    1.6000
    3.3000    1.0000
    4.9000    1.5000
    6.6000    2.1000
    4.9000    2.0000
    3.3000    1.0000
    5.1000    1.6000
    6.4000    2.0000
    4.8000    1.8000
    3.9000    1.4000


x_closest(:,:,3) =

    4.8000    1.4000
    6.3000    1.8000
    4.8000    1.8000
    3.5000    1.0000
    5.0000    1.7000
    6.1000    1.9000
    4.8000    1.8000
    3.5000    1.0000
    4.7000    1.4000
    6.1000    2.3000


x_closest(:,:,4) =

    5.1000    2.4000
    1.6000    0.6000
    4.7000    1.4000
    6.0000    1.8000
    3.9000    1.4000
    4.0000    1.3000
    4.7000    1.5000
    6.1000    2.5000
    4.5000    1.5000
    4.0000    1.3000

>> dist_sorted

dist_sorted =

    0.0500    0.1118    0.1118    0.1118    0.1803    0.2062    0.2500    0.3041    0.3041    0.3041
    0.3000    0.3162    0.3606    0.4123    0.6000    0.7280    0.9055    0.9487    1.0198    1.0296
    0.9434    1.0198    1.0296    1.0296    1.0630    1.0630    1.0630    1.1045    1.1045    1.1180
    2.6000    2.7203    2.8178    2.8178    2.8320    2.9155    2.9155    2.9275    2.9732    2.9732

为了与 knnsearch 进行比较,您将为第二个参数指定一个点矩阵,其中每一行是一个查询点,您将看到索引和排序距离在此实现和knnsearch.

To compare this with knnsearch, you would instead specify a matrix of points for the second parameter where each row is a query point and you will see that the indices and sorted distances match between this implementation and knnsearch.

希望对你有帮助.祝你好运!

Hope this helps you. Good luck!

这篇关于寻找K-最近邻及其实现的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆