了解高斯混合模型的概念 [英] Understanding concept of Gaussian Mixture Models

查看:88
本文介绍了了解高斯混合模型的概念的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试通过阅读在线资源来了解GMM.我已经使用K-Means实现了聚类,并且正在观察GMM与K-means的比较.

这是我所了解的,如果我的概念有误,请告诉我:

在两种情况下都可以实现聚类的意义上,

GMM类似于KNN.但是在GMM中,每个聚类都有自己独立的均值和协方差.此外,k均值对群集进行数据点的硬分配,而在GMM中,我们得到了独立的高斯分布的集合,对于每个数据点,我们都有属于该分布之一的可能性.

为了更好地理解它,我使用MatLab对其进行编码并实现所需的聚类.我已经使用SIFT功能来进行特征提取.并且已经使用k-means聚类来初始化值. (来自 VLFeat 文档)

%images is a 459 x 1 cell array where each cell contains the training image
[locations, all_feats] = vl_dsift(single(images{1}), 'fast', 'step', 50); %all_feats will be 128 x no. of keypoints detected
for i=2:(size(images,1))
    [locations, feats] = vl_dsift(single(images{i}), 'fast', 'step', 50);
    all_feats = cat(2, all_feats, feats); %cat column wise all features
end

numClusters = 50; %Just a random selection.
% Run KMeans to pre-cluster the data
[initMeans, assignments] = vl_kmeans(single(all_feats), numClusters, ...
    'Algorithm','Lloyd', ...
    'MaxNumIterations',5);

initMeans = double(initMeans); %GMM needs it to be double

% Find the initial means, covariances and priors
for i=1:numClusters
    data_k = all_feats(:,assignments==i);
    initPriors(i) = size(data_k,2) / numClusters;

    if size(data_k,1) == 0 || size(data_k,2) == 0
        initCovariances(:,i) = diag(cov(data'));
    else
        initCovariances(:,i) = double(diag(cov(double((data_k')))));
    end
end

% Run EM starting from the given parameters
[means,covariances,priors,ll,posteriors] = vl_gmm(double(all_feats), numClusters, ...
    'initialization','custom', ...
    'InitMeans',initMeans, ...
    'InitCovariances',initCovariances, ...
    'InitPriors',initPriors);

基于上述内容,我有meanscovariancespriors.我的主要问题是,现在呢?我现在有点迷路了.

meanscovariances向量的大小也分别为128 x 50.我期望它们是1 x 50,因为每一列都是一个簇,每个簇是否只有一个均值和协方差? (我知道128个是SIFT功能,但我期望均值和协方差.)

在k均值中,我使用了MatLab命令knnsearch(X,Y),该命令基本上是针对Y中的每个点在X中找到最近的邻居.

因此,如何在GMM中实现这一目标,我知道它是一系列概率的集合,当然,与该概率最接近的匹配项将是我们的获胜集群.这就是我感到困惑的地方. 在线上的所有教程都讲授了如何实现meanscovariances值,但是在聚类方面并没有过多地说明如何实际使用它们.

谢谢

解决方案

我认为,如果您先看看 GMM 模型代表.我将使用正态分布的混合情况开始.每个高斯由一对平均值 EM聚类的想法是,每个分布都代表一个聚类.因此,在上面的一维数据示例中,如果给定实例x = 0.5,我们将其分配为属于第一个具有99.5%概率的集群/模式

>> x = 0.5;
>> posterior(gmm, x)
ans =
    0.9950    0.0050    % probability x came from each component

您可以看到实例在第一个钟形曲线下的下落情况.而如果您在中间取一个点,答案将会更加模棱两可(分配给class = 2的点,但是不确定性要大得多):

>> x = 2.2
>> posterior(gmm, 2.2)
ans =
    0.4717    0.5283


相同的概念通​​过多元正态分布扩展到更高的维度.在多个维度上,协方差矩阵是方差的泛化,目的是考虑要素之间的相互依赖性.

这又是一个示例,其中二维混合了两个MVN分布:

% first distribution is centered at (0,0), second at (-1,3)
mu = [0 0; 3 3];

% covariance of first is identity matrix, second diagonal
sigma = cat(3, eye(2), [5 0; 0 1]);

% again I'm using equal priors
p = [0.5; 0.5];

% build GMM
gmm = gmdistribution(mu, sigma, p);

% 2D projection
ezcontourf(@(x,y) pdf(gmm,[x y]));

% view PDF surface
ezsurfc(@(x,y) pdf(gmm,[x y]));

协方差矩阵如何影响关节密度函数的形状背后有一些直觉.例如在2D中,如果矩阵是对角线,则表示这两个维度不会同时变化.在那种情况下,PDF看起来像是一个轴对齐的椭圆,它根据水平方向上的差异较大而水平或垂直延伸.如果它们相等,则形状是一个完美的圆(分布在两个维度上的分布速率相等).最后,如果协方差矩阵是任意的(按定义非对角但仍然对称),则它看起来可能像是一个以一定角度旋转的拉伸椭圆.

因此,在上图中,您应该能够区分两个凸点"以及每个代表的个体分布.当您使用3D或更高尺寸时,请以3D尺寸代表(hyper-)椭球. >


现在,当您使用GMM执行集群时,目标是找到模型参数(每个分布以及先验分布),以使生成的模型最适合数据.最适合的估算值可转化为最大化给定GMM模型的数据的可能性(这意味着您选择的模型会最大化).

正如其他人所解释的那样,这可以通过 EM算法迭代地解决. EM从对混合物模型参数的初始估计或猜测开始.根据参数产生的混合密度,迭代地对数据实例进行重新评分.然后将重新计分的实例用于更新参数估计.重复此过程,直到算法收敛为止.

不幸的是,EM算法对模型的初始化非常敏感,因此,如果设置的初始值很低,甚至陷入 K均值作为第一步(如您所示)在您的代码中),然后使用这些聚类的均值/均值来初始化EM.

与其他集群分析技术一样,我们首先需要确定要使用的集群数量. 交叉验证是一种可靠的方法,可以很好地估计群集数量. >

EM群集受以下事实困扰:需要拟合很多参数,并且通常需要大量数据和多次迭代才能获得良好的结果.具有M混合物和D维数据的无约束模型涉及拟合D*D*M + D*M + M参数(M个协方差矩阵,每个D大小为DxD,加上M个平均长度为D的向量,以及一个长度为M的先验向量).对于大量维的数据集,这可能是个问题.因此,通常会施加限制和假设以简化问题(一种正则化,以避免过度拟合问题).例如,您可以将协方差矩阵固定为仅对角线,甚至可以在所有高斯人之间共享 a 的协方差矩阵.

最后,一旦拟合了混合模型,就可以通过使用每个混合分量计算数据实例的后验概率来探索聚类(就像我在1D示例中展示的那样). GMM根据这种成员资格"可能性将每个实例分配给集群.


这是使用高斯混合模型进行聚类数据的更完整示例:

% load Fisher Iris dataset
load fisheriris

% project it down to 2 dimensions for the sake of visualization
[~,data] = pca(meas,'NumComponents',2);
mn = min(data); mx = max(data);
D = size(data,2);    % data dimension    

% inital kmeans step used to initialize EM
K = 3;               % number of mixtures/clusters
cInd = kmeans(data, K, 'EmptyAction','singleton');

% fit a GMM model
gmm = fitgmdist(data, K, 'Options',statset('MaxIter',1000), ...
    'CovType','full', 'SharedCov',false, 'Regularize',0.01, 'Start',cInd);

% means, covariances, and mixing-weights
mu = gmm.mu;
sigma = gmm.Sigma;
p = gmm.PComponents;

% cluster and posterior probablity of each instance
% note that: [~,clustIdx] = max(p,[],2)
[clustInd,~,p] = cluster(gmm, data);
tabulate(clustInd)

% plot data, clustering of the entire domain, and the GMM contours
clrLite = [1 0.6 0.6 ; 0.6 1 0.6 ; 0.6 0.6 1];
clrDark = [0.7 0 0 ; 0 0.7 0 ; 0 0 0.7];
[X,Y] = meshgrid(linspace(mn(1),mx(1),50), linspace(mn(2),mx(2),50));
C = cluster(gmm, [X(:) Y(:)]);
image(X(:), Y(:), reshape(C,size(X))), hold on
gscatter(data(:,1), data(:,2), species, clrDark)
h = ezcontour(@(x,y)pdf(gmm,[x y]), [mn(1) mx(1) mn(2) mx(2)]);
set(h, 'LineColor','k', 'LineStyle',':')
hold off, axis xy, colormap(clrLite)
title('2D data and fitted GMM'), xlabel('PC1'), ylabel('PC2')

I'm trying to understand GMM by reading the sources available online. I have achieved clustering using K-Means and was seeing how GMM would compare to K-means.

Here is what I have understood, please let me know if my concept is wrong:

GMM is like KNN, in the sense that clustering is achieved in both cases. But in GMM each cluster has their own independent mean and covariance. Furthermore k-means performs hard assignments of data points to clusters whereas in GMM we get a collection of independant gaussian distributions, and for each data point we have a probability that it belongs to one of the distributions.

To understand it better I have used MatLab to code it and achieve the desired clustering. I have used SIFT features for the purpose of feature extraction. And have used k-means clustering to initialize the values. (This is from the VLFeat documentation)

%images is a 459 x 1 cell array where each cell contains the training image
[locations, all_feats] = vl_dsift(single(images{1}), 'fast', 'step', 50); %all_feats will be 128 x no. of keypoints detected
for i=2:(size(images,1))
    [locations, feats] = vl_dsift(single(images{i}), 'fast', 'step', 50);
    all_feats = cat(2, all_feats, feats); %cat column wise all features
end

numClusters = 50; %Just a random selection.
% Run KMeans to pre-cluster the data
[initMeans, assignments] = vl_kmeans(single(all_feats), numClusters, ...
    'Algorithm','Lloyd', ...
    'MaxNumIterations',5);

initMeans = double(initMeans); %GMM needs it to be double

% Find the initial means, covariances and priors
for i=1:numClusters
    data_k = all_feats(:,assignments==i);
    initPriors(i) = size(data_k,2) / numClusters;

    if size(data_k,1) == 0 || size(data_k,2) == 0
        initCovariances(:,i) = diag(cov(data'));
    else
        initCovariances(:,i) = double(diag(cov(double((data_k')))));
    end
end

% Run EM starting from the given parameters
[means,covariances,priors,ll,posteriors] = vl_gmm(double(all_feats), numClusters, ...
    'initialization','custom', ...
    'InitMeans',initMeans, ...
    'InitCovariances',initCovariances, ...
    'InitPriors',initPriors);

Based on the above I have means, covariances and priors. My main question is, What now? I am kind of lost now.

Also the means, covariances vectors are each of the size 128 x 50. I was expecting them to be 1 x 50 since each column is a cluster, wont each cluster have only one mean and covariance? (I know 128 are the SIFT features but I was expecting means and covariances).

In k-means I used the the MatLab command knnsearch(X,Y) which basically finds the nearest neighbour in X for each point in Y.

So how to achieve this in GMM, I know its a collection of probabilities, and ofcourse the nearest match from that probability will be our winning cluster. And this is where I am confused. All tutorials online have taught how to achieve the means, covariances values, but do not say much in how to actually use them in terms of clustering.

Thank you

解决方案

I think it would help if you first look at what a GMM model represents. I'll be using functions from the Statistics Toolbox, but you should be able to do the same using VLFeat.

Let's start with the case of a mixture of two 1-dimensional normal distributions. Each Gaussian is represented by a pair of mean and variance. The mixture assign a weight to each component (prior).

For example, lets mix two normal distributions with equal weights (p = [0.5; 0.5]), the first centered at 0 and the second at 5 (mu = [0; 5]), and the variances equal 1 and 2 respectively for the first and second distributions (sigma = cat(3, 1, 2)).

As you can see below, the mean effectively shifts the distribution, while the variance determines how wide/narrow and flat/pointy it is. The prior sets the mixing proportions to get the final combined model.

% create GMM
mu = [0; 5];
sigma = cat(3, 1, 2);
p = [0.5; 0.5];
gmm = gmdistribution(mu, sigma, p);

% view PDF
ezplot(@(x) pdf(gmm,x));

The idea of EM clustering is that each distribution represents a cluster. So in the example above with one dimensional data, if you were given an instance x = 0.5, we would assign it as belonging to the first cluster/mode with 99.5% probability

>> x = 0.5;
>> posterior(gmm, x)
ans =
    0.9950    0.0050    % probability x came from each component

you can see how the instance falls well under the first bell-curve. Whereas if you take a point in the middle, the answer would be more ambiguous (point assigned to class=2 but with much less certainty):

>> x = 2.2
>> posterior(gmm, 2.2)
ans =
    0.4717    0.5283


The same concepts extend to higher dimension with multivariate normal distributions. In more than one dimension, the covariance matrix is a generalization of variance, in order to account for inter-dependencies between features.

Here is an example again with a mixture of two MVN distributions in 2-dimensions:

% first distribution is centered at (0,0), second at (-1,3)
mu = [0 0; 3 3];

% covariance of first is identity matrix, second diagonal
sigma = cat(3, eye(2), [5 0; 0 1]);

% again I'm using equal priors
p = [0.5; 0.5];

% build GMM
gmm = gmdistribution(mu, sigma, p);

% 2D projection
ezcontourf(@(x,y) pdf(gmm,[x y]));

% view PDF surface
ezsurfc(@(x,y) pdf(gmm,[x y]));

There is some intuition behind how the the covariance matrix affects the shape of the joint density function. For instance in 2D, if the matrix is diagonal it implies that the two dimensions don't co-vary. In that case the PDF would look like an axis-aligned ellipse stretched out either horizontally or vertically according to which dimension has the bigger variance. If they are equal, then the shape is a perfect circle (distribution spread out in both dimensions at an equal rate). Finally if the covariance matrix is arbitrary (non-diagonal but still symmetric by definition), then it will probably look like a stretched ellipse rotated at some angle.

So in the previous figure, you should be able to tell the two "bumps" apart and what individual distribution each represent. When you go 3D and higher dimensions, think of the it as representing (hyper-)ellipsoids in N-dims.


Now when you're performing clustering using GMM, the goal is to find the model parameters (mean and covariance of each distribution as well as the priors) so that the resulting model best fits the data. The best-fit estimation translates into maximizing the likelihood of the data given the GMM model (meaning you choose model that maximizes Pr(data|model)).

As other have explained, this is solved iteratively using the EM algorithm; EM starts with an initial estimate or guess of the parameters of the mixture model. It iteratively re-scores the data instances against the mixture density produced by the parameters. The re-scored instances are then used to update the parameter estimates. This is repeated until the algorithm converges.

Unfortunately the EM algorithm is very sensitive to the initialization of the model, so it might take a long time to converge if you set poor initial values, or even get stuck in local optima. A better way to initial the GMM parameters is to use K-means as a first step (like you've shown in your code), and using the mean/cov of those clusters to initialize EM.

As with other cluster analysis techniques, we first need to decide on the number of clusters to use. Cross-validation is a robust way to find a good estimate of the number of clusters.

EM clustering suffers from the fact that there a lot parameters to fit, and usually requires lots of data and many iterations to get good results. An unconstrained model with M-mixtures and D-dimensional data involves fitting D*D*M + D*M + M parameters (M covariance matrices each of size DxD, plus M mean vectors of length D, plus a vector of priors of length M). That could be a problem for datasets with large number of dimensions. So it is customary to impose restrictions and assumption to simplify the problem (a sort of regularization to avoid overfitting problems). For instance you could fix the covariance matrix to be only diagonal or even have the covariance matrices shared across all Gaussians.

Finally once you've fitted the mixture model, you can explore the clusters by computing the posterior probability of data instances using each mixture component (like I've showed with the 1D example). GMM assigns each instance to a cluster according to this "membership" likelihood.


Here is a more complete example of clustering data using Gaussian mixture models:

% load Fisher Iris dataset
load fisheriris

% project it down to 2 dimensions for the sake of visualization
[~,data] = pca(meas,'NumComponents',2);
mn = min(data); mx = max(data);
D = size(data,2);    % data dimension    

% inital kmeans step used to initialize EM
K = 3;               % number of mixtures/clusters
cInd = kmeans(data, K, 'EmptyAction','singleton');

% fit a GMM model
gmm = fitgmdist(data, K, 'Options',statset('MaxIter',1000), ...
    'CovType','full', 'SharedCov',false, 'Regularize',0.01, 'Start',cInd);

% means, covariances, and mixing-weights
mu = gmm.mu;
sigma = gmm.Sigma;
p = gmm.PComponents;

% cluster and posterior probablity of each instance
% note that: [~,clustIdx] = max(p,[],2)
[clustInd,~,p] = cluster(gmm, data);
tabulate(clustInd)

% plot data, clustering of the entire domain, and the GMM contours
clrLite = [1 0.6 0.6 ; 0.6 1 0.6 ; 0.6 0.6 1];
clrDark = [0.7 0 0 ; 0 0.7 0 ; 0 0 0.7];
[X,Y] = meshgrid(linspace(mn(1),mx(1),50), linspace(mn(2),mx(2),50));
C = cluster(gmm, [X(:) Y(:)]);
image(X(:), Y(:), reshape(C,size(X))), hold on
gscatter(data(:,1), data(:,2), species, clrDark)
h = ezcontour(@(x,y)pdf(gmm,[x y]), [mn(1) mx(1) mn(2) mx(2)]);
set(h, 'LineColor','k', 'LineStyle',':')
hold off, axis xy, colormap(clrLite)
title('2D data and fitted GMM'), xlabel('PC1'), ylabel('PC2')

这篇关于了解高斯混合模型的概念的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆