在MATLAB中优化手动编码的k均值? [英] optimizing manually-coded k-means in MATLAB?

查看:62
本文介绍了在MATLAB中优化手动编码的k均值?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

所以我在MATLAB中编写了一个k-means脚本,因为本机函数似乎效率不高,而且似乎可以完全正常运行.它似乎适用于我正在使用的小型训练集(这是通过文本文件输入的150x2矩阵).但是,对于我的目标数据集(3924x19矩阵)而言,运行时间花费的时间成倍增加.

So I'm writing a k-means script in MATLAB, since the native function doesn't seem to be very efficient, and it seems to be fully operational. It appears to work on the small training set that I'm using (which is a 150x2 matrix fed via text file). However, the runtime is taking exponentially longer for my target data set, which is a 3924x19 matrix.

我在矢量化方面不是最出色的,因此任何建议将不胜感激.到目前为止,这是我的k-means脚本(我知道我将不得不调整收敛条件,因为它正在寻找精确匹配,因此对于如此大的数据集,我可能需要更多的迭代,但是我希望它能够能够先在合理的时间内完成操作,然后再提高该数字):

I'm not the greatest at vectorization, so any suggestions would be greatly appreciated. Here's my k-means script so far (I know I'm going to have to adjust my convergence condition, since it's looking for an exact match, and I'll probably need more iterations for a dataset this large, but I want it to be able to finish in a reasonable time first, before I crank that number up):

clear all;

%take input file (manually specified by user
disp('Please type input filename (in working directory): ')
target_file = input('filename: ', 's');

%parse and load into matrix
data = load(target_file);

%prompt name of output file for later) UNCOMMENT BELOW TWO LINES LATER
% disp('Please type output filename (to be saved in working directory): ')
% output_name = input('filename:', 's')

%prompt number of clusters
disp('Please type desired number of clusters: ')
c = input ('number of clusters: ');

%specify type of kmeans algorithm ('regular' for regular, 'fuzzy' for fuzzy)
%UNCOMMENT BELOW TWO LINES LATER
% disp('Please specify type (regular or fuzzy):')
% runtype = input('type: ', 's')

%initialize cluster centroid locations within bounds given by data set

%initialize rangemax and rangemin row vectors
%with length same as number of dimensions
rangemax = zeros(1,size(data,2)); 
rangemin = zeros(1,size(data,2));

%map max and min values for bounds
for dim = 1:size(data,2)
    rangemax(dim) = max(data(:,dim));
    rangemin(dim) = min(data(:,dim));
end

% rangemax
% rangemin

%randomly initialize mu_k (center) locations in (k x n) matrix where k is
%cluster number and n is number of dimensions/coordinates

mu_k = zeros(c,size(data,2));

for k = 1:size(data,2)
    mu_k(k,:) = rangemin + (rangemax - rangemin).*rand(1,1);
end

mu_k

%iterate k-means

%initialize holding variable for distance comparison
comparisonmatrix = [];

%initialize assignment vector
assignment = zeros(size(data,1),1);

%initialize distance holding vector
dist = zeros(1,size(data,2));

%specify convergence threshold
%threshold = 0.001;

for iteration = 1:25

    %save current assignment values to check convergence condition
    hold_assignment = assignment;

    for point = 1:size(data,1)

        %calculate distances from point to centers
        for k = 1:c
            %holding variables
            comparisonmatrix = [data(point,:);mu_k(k,:)];

            dist(k) = pdist(comparisonmatrix);
        end

        %record location of mininum distance (location value will be between 1
        %and k)
        [minval, location] = min(dist);

        %assign cluster number (analogous to location value)
        assignment(point) = location;

    end

    %check convergence criteria

    if isequal(assignment,hold_assignment)
        break
    end

    %revise mu_k locations

    %count number of each label
    assignment_count = zeros(1,c);

    for i = 1:size(data,1)
        assignment_count(assignment(i)) = assignment_count(assignment(i)) + 1;
    end

    %compute centroids
    point_total = zeros(size(mu_k));

    for row = 1:size(data,1)
        point_total(assignment(row),:) = point_total(assignment(row)) + data(row,:);
    end

    %move mu_k values to centroids
    for center = 1:c
        mu_k(center,:) = point_total(center,:)/assignment_count(center);
    end
end

里面有很多循环,所以我觉得需要做很多优化.但是,我想我只是盯着这段代码太久了,所以有些新鲜的眼睛可能会有所帮助.如果需要在代码块中澄清任何内容,请告诉我.

There are a lot of loops in there, so I feel that there's a lot of optimization to be made. However, I think I've just been staring at this code for far too long, so some fresh eyes could help. Please let me know if I need to clarify anything in the code block.

在大型数据集上执行上述代码块(在上下文中)时,根据MATLAB的探查器,它需要3732.152秒才能进行完整的25次迭代(根据我的假设,它尚未收敛")条件),但有150个簇,但其中约130个返回了NaN(mu_k中有130行).

When the above code block is executed (in context) on the large dataset, it takes 3732.152 seconds, according to MATLAB's profiler, to make the full 25 iterations (I'm assuming it hasn't "converged" according to my criteria yet) for 150 clusters, but about 130 of them return NaNs (130 rows in mu_k).

推荐答案

配置文件会有所帮助,但是重做代码的地方是避免数据点(for point = 1:size(data,1))数量上的循环.向量化.

Profiling will help, but the place to rework your code is to avoid the loop over the number of data points (for point = 1:size(data,1)). Vectorize that.

在您的for iteration循环中,这是一个快速的部分示例,

In your for iteration loop here is a quick partial example,

[nPoints,nDims] = size(data);

% Calculate all high-dimensional distances at once
kdiffs = bsxfun(@minus,data,permute(mu_k,[3 2 1])); % NxDx1 - 1xDxK => NxDxK
distances = sum(kdiffs.^2,2); % no need to do sqrt
distances = squeeze(distances); % Nx1xK => NxK

% Find closest cluster center for each point
[~,ik] = min(distances,[],2); % Nx1

% Calculate the new cluster centers (mean the data)
mu_k_new = zeros(c,nDims);
for i=1:c,
    indk = ik==i;
    clustersizes(i) = nnz(indk);
    mu_k_new(i,:) = mean(data(indk,:))';
end

这不是唯一(或最好)的方法,但它应该是一个不错的例子.

This isn't the only (or the best) way to do it, but it should be a decent example.

其他一些评论:

  1. 使此脚本成为有效处理输入参数的函数,而不是使用input.
  2. 如果您想要一种简单的方法来指定文件,请参见uigetfile.
  3. 对于许多MATLAB函数,例如maxminsummean等,您可以指定函数应在其上运行的尺寸.这样,您就可以在矩阵上运行它,并同时计算多个条件/维度的值.
  4. 一旦获得了不错的性能,请考虑进行更长的迭代,特别是直到中心不再变化或变化聚类的样本数量变小时为止.
  5. 每个点的最小距离ik的簇与平方欧几里德距离相同.
  1. Instead of using input, make this script into a function to efficiently handle input arguments.
  2. If you want an easy way to specify a file, see uigetfile.
  3. With many MATLAB functions, such as max, min, sum, mean, etc., you can specify a dimension over which the function should operate. This way you an run it on a matrix and compute values for multiple conditions/dimensions at the same time.
  4. Once you get decent performance, consider iterating longer, specifically until the centers no longer change or the number of samples that change clusters becomes small.
  5. The cluster with the smallest distance for each point, ik, will be the same with squared Euclidean distance.

这篇关于在MATLAB中优化手动编码的k均值?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆