这个用于标准化数据的代码是如何工作的? [英] How does this code for standardizing data work?

查看:23
本文介绍了这个用于标准化数据的代码是如何工作的?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我为机器学习课程提供了一个提供的 standardize 函数,该函数没有很好的文档记录,而且我还是 MATLAB 的新手,所以我只是想分解这个函数.对语法或标准化的一般概念的任何解释都会有很大帮助.我们使用这个函数来标准化一个大矩阵中提供的一组训练数据.分解代码片段的大部分行会对我有很大帮助.非常感谢.

function [X, mean_X, std_X] = 标准化(varargin)开关量情况1mean_X = mean(varargin{1});std_X = std(varargin{1});X = varargin{1} - repmat(mean_X, [size(varargin{1}, 1) 1]);对于 i = 1:size(X, 2)X(:, i) = X(:, i)/std(X(:, i));结尾案例3mean_X = 可变参数{2};std_X = 可变参数{3};X = varargin{1} - repmat(mean_X, [size(varargin{1}, 1) 1]);对于 i = 1:size(X, 2)X(:, i) = X(:, i)/std_X(:, i);结尾结尾

解决方案

此代码接受大小为 M x N 的数据矩阵,其中 M 是一的维度来自该矩阵的数据样本,N 是样本总数.因此,该矩阵的一列是一个数据样本.数据样本都是水平堆叠的,都是列.

现在,此代码的真正目的是获取矩阵的所有列并标准化/归一化数据,以便每个数据样本都展示 函数.这避免了必须做任何元素的重复,我们可以在幕后做到这一点.我会重写这个函数,使它看起来像这样:

function [X, mean_X, std_X] = 标准化(varargin)开关量情况1mean_X = mean(varargin{1});%//求每一列的均值std_X = std(varargin{1});%//查找标准.开发每列的X = bsxfun(@minus, varargin{1}, mean_X);%//每列减去各自的平均值X = bsxfun(@rdivide, X, std_X);%//取每一列并除以各自的 std dev.案例3mean_X = 可变参数{2};std_X = 可变参数{3};%//同上代码X = bsxfun(@minus, varargin{1}, mean_X);X = bsxfun(@rdivide, X, std_X);结尾

我认为上面的新代码比使用 forrepmat 快得多.事实上,众所周知 bsxfun 比前一种方法更快 - 特别是对于较大的矩阵.

I have a provided standardize function for a machine learning course that wasn't well documented and I'm still new to MATLAB so I'm just trying to break down the function. Any explanation of the syntax or the general idea of standardizing would greatly help. We use this function to standardize a set of training data provided in a large matrix. A break down of most of the lines of the code snippet would help me greatly. Thank you so much.

function [X, mean_X, std_X] = standardize(varargin)
switch nargin
    case 1
        mean_X = mean(varargin{1});
        std_X = std(varargin{1});

        X = varargin{1} - repmat(mean_X, [size(varargin{1}, 1) 1]);

        for i = 1:size(X, 2)
            X(:, i) =  X(:, i) / std(X(:, i));
        end     
    case 3
        mean_X = varargin{2};
        std_X = varargin{3};
        X = varargin{1} - repmat(mean_X, [size(varargin{1}, 1) 1]);
        for i = 1:size(X, 2)
            X(:, i) =  X(:, i) / std_X(:, i);
        end 
end

解决方案

This code accepts a data matrix of size M x N, where M is the dimensionality of one data sample from this matrix and N is the total number of samples. Therefore, one column of this matrix is one data sample. Data samples are all stacked horizontally and are columns.

Now, the true purpose of this code is to take all of the columns of your matrix and standardize / normalize the data so that each data sample exhibits zero mean and unit variance. This means that after this transform, if you found the mean value of any column in this matrix, it would be 0 and the variance would be 1. This is a very standard method for normalizing values in statistical analysis, machine learning, and computer vision.

This actually comes from the z-score in statistical analysis. Specifically, the equation for normalization is:

Given a set of data points, we subtract the value in question by the mean of these data points, then divide by the respective standard deviation. How you'd call this code is the following. Given this matrix, which we will call X, there are two ways you can call this code:

  • Method #1: [X, mean_X, std_X] = standardize(X);
  • Method #2: [X, mean_X, std_X] = standardize(X, mu, sigma);

The first method automatically infers the mean of each column of X and the standard deviation of each column of X. mean_X and std_X will both return 1 x N vectors that give you the mean and standard deviation of each column in the matrix X. The second method allows you to manually specify a mean (mu) and standard deviation (sigma) for each column of X. This is possibly for use in debugging, but you would specify both mu and sigma as 1 x N vectors in this case. What is returned for mean_X and std_X is identical to mu and sigma.

The code is a bit poorly written IMHO, because you can certainly achieve this vectorized, but the gist of the code is that it finds the mean of every column of the matrix X if we are are using Method #1, duplicates this vector so that it becomes a M x N matrix, then we subtract this matrix with X. This will subtract each column by its respective mean. We also compute the standard deviation of each column before the mean subtraction.

Once we do that, we then normalize our X by dividing each column by its respective standard deviation. BTW, doing std_X(:, i) is superfluous as std_X is already a 1 x N vector. std_X(:, i) means to grab all of the rows at the ith column. If we already have a 1 x N vector, this can simply be replaced with std_X(i) - a bit overkill for my taste.

Method #2 performs the same thing as Method #1, but we provide our own mean and standard deviation for each column of X.

For the sake of documentation, this is how I would have commented the code:

function [X, mean_X, std_X] = standardize(varargin)
switch nargin %// Check how many input variables we have input into the function
    case 1 %// If only one variable - this is the input matrix
        mean_X = mean(varargin{1}); %// Find mean of each column
        std_X = std(varargin{1}); %// Find standard deviation of each column

        %// Take each column of X and subtract by its corresponding mean
        %// Take mean_X and duplicate M times vertically
        X = varargin{1} - repmat(mean_X, [size(varargin{1}, 1) 1]);

        %// Next, for each column, normalize by its respective standard deviation
        for i = 1:size(X, 2)
            X(:, i) =  X(:, i) / std(X(:, i));
        end     
    case 3 %// If we provide three inputs
        mean_X = varargin{2}; %// Second input is a mean vector
        std_X = varargin{3}; %// Third input is a standard deviation vector

        %// Apply the code as seen in the first case
        X = varargin{1} - repmat(mean_X, [size(varargin{1}, 1) 1]);
        for i = 1:size(X, 2)
            X(:, i) =  X(:, i) / std_X(:, i);
        end 
end


If I can suggest another way to write this code, I would use the mighty and powerful bsxfun function. This avoids having to do any duplication of elements and we can do this under the hood. I would rewrite this function so that it looks like this:

function [X, mean_X, std_X] = standardize(varargin)
switch nargin
    case 1
        mean_X = mean(varargin{1}); %// Find mean of each column
        std_X = std(varargin{1}); %// Find std. dev. of each column

        X = bsxfun(@minus, varargin{1}, mean_X); %// Subtract each column by its respective mean
        X = bsxfun(@rdivide, X, std_X); %// Take each column and divide by its respective std dev.

    case 3
        mean_X = varargin{2};
        std_X = varargin{3};

        %// Same code as above
        X = bsxfun(@minus, varargin{1}, mean_X);
        X = bsxfun(@rdivide, X, std_X);
end

I would argue that the new code above is much faster than using for and repmat. In fact, it is known that bsxfun is faster than the former approach - especially for larger matrices.

这篇关于这个用于标准化数据的代码是如何工作的?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆