这个用于标准化数据的代码如何工作? [英] How does this code for standardizing data work?

查看:80
本文介绍了这个用于标准化数据的代码如何工作?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我为一个机器学习课程提供了一个standardize函数,该函数没有很好的文档记录,但是我对MATLAB还是陌生的,所以我只是在尝试分解该函数.对语法的任何解释或对标准化的一般想法都将大有帮助.我们使用此功能来标准化以大矩阵形式提供的一组训练数据.对代码片段的大多数行进行细分将对我有很大帮助.太感谢了.

I have a provided standardize function for a machine learning course that wasn't well documented and I'm still new to MATLAB so I'm just trying to break down the function. Any explanation of the syntax or the general idea of standardizing would greatly help. We use this function to standardize a set of training data provided in a large matrix. A break down of most of the lines of the code snippet would help me greatly. Thank you so much.

function [X, mean_X, std_X] = standardize(varargin)
switch nargin
    case 1
        mean_X = mean(varargin{1});
        std_X = std(varargin{1});

        X = varargin{1} - repmat(mean_X, [size(varargin{1}, 1) 1]);

        for i = 1:size(X, 2)
            X(:, i) =  X(:, i) / std(X(:, i));
        end     
    case 3
        mean_X = varargin{2};
        std_X = varargin{3};
        X = varargin{1} - repmat(mean_X, [size(varargin{1}, 1) 1]);
        for i = 1:size(X, 2)
            X(:, i) =  X(:, i) / std_X(:, i);
        end 
end

推荐答案

此代码接受大小为M x N的数据矩阵,其中M是该矩阵中一个数据样本的维数,而N是总数样本数.因此,此矩阵的一列是一个数据样本.数据样本均水平堆叠且为列.

This code accepts a data matrix of size M x N, where M is the dimensionality of one data sample from this matrix and N is the total number of samples. Therefore, one column of this matrix is one data sample. Data samples are all stacked horizontally and are columns.

现在,此代码的真正目的是获取矩阵的所有列并标准化/规范化数据,以便每个数据样本均显示

Now, the true purpose of this code is to take all of the columns of your matrix and standardize / normalize the data so that each data sample exhibits zero mean and unit variance. This means that after this transform, if you found the mean value of any column in this matrix, it would be 0 and the variance would be 1. This is a very standard method for normalizing values in statistical analysis, machine learning, and computer vision.

这实际上来自统计分析中的 z得分.具体来说,归一化公式为:

This actually comes from the z-score in statistical analysis. Specifically, the equation for normalization is:

给出一组数据点,我们用这些数据点的平均值减去所讨论的值,然后除以相应的标准差.下面是如何调用此代码.给定此矩阵(我们称为X),您可以通过两种方式调用此代码:

Given a set of data points, we subtract the value in question by the mean of these data points, then divide by the respective standard deviation. How you'd call this code is the following. Given this matrix, which we will call X, there are two ways you can call this code:

  • 方法1:[X, mean_X, std_X] = standardize(X);
  • 方法2:[X, mean_X, std_X] = standardize(X, mu, sigma);
  • Method #1: [X, mean_X, std_X] = standardize(X);
  • Method #2: [X, mean_X, std_X] = standardize(X, mu, sigma);

第一种方法自动推断X的每一列的平均值和X的每一列的标准偏差. mean_Xstd_X都将返回1 x N向量,这些向量为您提供矩阵X中每一列的均值和标准差.第二种方法允许您为X的每一列手动指定平均值(mu)和标准差(sigma).这可能用于调试,但是在这种情况下,您需要将musigma都指定为1 x N向量. mean_Xstd_X返回的内容与musigma相同.

The first method automatically infers the mean of each column of X and the standard deviation of each column of X. mean_X and std_X will both return 1 x N vectors that give you the mean and standard deviation of each column in the matrix X. The second method allows you to manually specify a mean (mu) and standard deviation (sigma) for each column of X. This is possibly for use in debugging, but you would specify both mu and sigma as 1 x N vectors in this case. What is returned for mean_X and std_X is identical to mu and sigma.

该代码写的有点不好,因为您确实可以实现矢量化,但是代码的要点是,如果我们使用的是方法#1,它会找到矩阵X每一列的均值,复制此向量,使其成为M x N矩阵,然后用X减去该矩阵.这将通过其各自的平均值减去每列.我们还计算了均值减去之前每列的标准差.

The code is a bit poorly written IMHO, because you can certainly achieve this vectorized, but the gist of the code is that it finds the mean of every column of the matrix X if we are are using Method #1, duplicates this vector so that it becomes a M x N matrix, then we subtract this matrix with X. This will subtract each column by its respective mean. We also compute the standard deviation of each column before the mean subtraction.

一旦完成此操作,我们便通过将每列除以其各自的标准偏差来归一化X.顺便说一句,做std_X(:, i)是多余的,因为std_X已经是一个1 x N向量. std_X(:, i)表示在i th 列中捕获所有行.如果我们已经有了1 x N载体,则可以简单地将其替换为std_X(i)-对我的口味来说有点过头了.

Once we do that, we then normalize our X by dividing each column by its respective standard deviation. BTW, doing std_X(:, i) is superfluous as std_X is already a 1 x N vector. std_X(:, i) means to grab all of the rows at the ith column. If we already have a 1 x N vector, this can simply be replaced with std_X(i) - a bit overkill for my taste.

方法2与方法1的功能相同,但是我们为X的每一列提供了自己的均值和标准差.

Method #2 performs the same thing as Method #1, but we provide our own mean and standard deviation for each column of X.

为了便于说明,这就是我要注释代码的方式:

For the sake of documentation, this is how I would have commented the code:

function [X, mean_X, std_X] = standardize(varargin)
switch nargin %// Check how many input variables we have input into the function
    case 1 %// If only one variable - this is the input matrix
        mean_X = mean(varargin{1}); %// Find mean of each column
        std_X = std(varargin{1}); %// Find standard deviation of each column

        %// Take each column of X and subtract by its corresponding mean
        %// Take mean_X and duplicate M times vertically
        X = varargin{1} - repmat(mean_X, [size(varargin{1}, 1) 1]);

        %// Next, for each column, normalize by its respective standard deviation
        for i = 1:size(X, 2)
            X(:, i) =  X(:, i) / std(X(:, i));
        end     
    case 3 %// If we provide three inputs
        mean_X = varargin{2}; %// Second input is a mean vector
        std_X = varargin{3}; %// Third input is a standard deviation vector

        %// Apply the code as seen in the first case
        X = varargin{1} - repmat(mean_X, [size(varargin{1}, 1) 1]);
        for i = 1:size(X, 2)
            X(:, i) =  X(:, i) / std_X(:, i);
        end 
end


如果我可以建议另一种编写此代码的方法,则可以使用功能强大的


If I can suggest another way to write this code, I would use the mighty and powerful bsxfun function. This avoids having to do any duplication of elements and we can do this under the hood. I would rewrite this function so that it looks like this:

function [X, mean_X, std_X] = standardize(varargin)
switch nargin
    case 1
        mean_X = mean(varargin{1}); %// Find mean of each column
        std_X = std(varargin{1}); %// Find std. dev. of each column

        X = bsxfun(@minus, varargin{1}, mean_X); %// Subtract each column by its respective mean
        X = bsxfun(@rdivide, X, std_X); %// Take each column and divide by its respective std dev.

    case 3
        mean_X = varargin{2};
        std_X = varargin{3};

        %// Same code as above
        X = bsxfun(@minus, varargin{1}, mean_X);
        X = bsxfun(@rdivide, X, std_X);
end

我认为上面的新代码比使用forrepmat快得多.实际上,众所周知bsxfun比以前的方法要快-尤其是对于较大的矩阵.

I would argue that the new code above is much faster than using for and repmat. In fact, it is known that bsxfun is faster than the former approach - especially for larger matrices.

这篇关于这个用于标准化数据的代码如何工作?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆