期望最大化技术的直观解释是什么? [英] What is an intuitive explanation of the Expectation Maximization technique?

本文介绍了期望最大化技术的直观解释是什么?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

期望最大化(EM)是一种对数据进行分类的概率方法.如果不是分类器,如果我错了,请纠正我.

Expectation Maximization (EM) is a kind of probabilistic method to classify data. Please correct me if I am wrong if it is not a classifier.

此EM技术的直观解释是什么?什么是expectation,什么是maximized?

What is an intuitive explanation of this EM technique? What is expectation here and what is being maximized?

推荐答案

注意:此答案背后的代码可以找到假设我们从两个不同的组(红色和蓝色)中采样了一些数据:

Suppose we have some data sampled from two different groups, red and blue:

在这里,我们可以看到哪个数据点属于红色或蓝色组.这样可以轻松找到表征每个组的参数.例如,红色组的均值约为3,蓝色组的均值约为7(如果需要,我们可以找到确切的均值).

Here, we can see which data point belongs to the red or blue group. This makes it easy to find the parameters that characterise each group. For example, the mean of the red group is around 3, the mean of the blue group is around 7 (and we could find the exact means if we wanted).

通常来说,这就是最大似然估计.给定一些数据,我们将计算最能解释该数据的一个或多个参数的值.

This is, generally speaking, known as maximum likelihood estimation. Given some data, we compute the value of a parameter (or parameters) that best explains that data.

现在想象我们不能看到哪个值是从哪个组中采样的.一切对我们来说都是紫色的:

Now imagine that we cannot see which value was sampled from which group. Everything looks purple to us:

在这里,我们知道有两个值组,但是我们不知道任何特定值属于哪个组.

Here we have the knowledge that there are two groups of values, but we don't know which group any particular value belongs to.

我们仍然可以估计最适合此数据的红色和蓝色组的平均值吗?

Can we still estimate the means for the red group and blue group that best fit this data?

是的,通常我们可以! 期望最大化为我们提供了一种实现方法.该算法背后的一个非常笼统的想法是:

Yes, often we can! Expectation Maximisation gives us a way to do it. The very general idea behind the algorithm is this:

  1. 从每个参数可能的初始估计开始.
  2. 计算每个参数产生数据点的可能性.
  3. 根据参数产生的可能性,计算每个数据点的权重,以指示它是红色还是蓝色.将权重与数据(期望)结合起来.
  4. 使用权重调整后的数据(最大化)为参数计算出更好的估算值.
  5. 重复步骤2到4,直到参数估计值收敛(过程停止产生其他估计值)为止.
  1. Start with an initial estimate of what each parameter might be.
  2. Compute the likelihood that each parameter produces the data point.
  3. Calculate weights for each data point indicating whether it is more red or more blue based on the likelihood of it being produced by a parameter. Combine the weights with the data (expectation).
  4. Compute a better estimate for the parameters using the weight-adjusted data (maximisation).
  5. Repeat steps 2 to 4 until the parameter estimate converges (the process stops producing a different estimate).

这些步骤需要进一步说明,因此我将逐步解决上述问题.

These steps need some further explanation, so I'll walk through the problem described above.

在本示例中,我将使用Python,但是如果您不熟悉这种语言,则代码应该相当容易理解.

I'll use Python in this example, but the code should be fairly easy to understand if you're not familiar with this language.

假设我们有两组,红色和蓝色,其值的分布如上图所示.具体来说,每个组均包含从正态分布提取的值,该参数具有以下参数:

Suppose we have two groups, red and blue, with the values distributed as in the image above. Specifically, each group contains a value drawn from a normal distribution with the following parameters:

import numpy as np
from scipy import stats

np.random.seed(110) # for reproducible results

# set parameters
red_mean = 3
red_std = 0.8

blue_mean = 7
blue_std = 2

# draw 20 samples from normal distributions with red/blue parameters
red = np.random.normal(red_mean, red_std, size=20)
blue = np.random.normal(blue_mean, blue_std, size=20)

both_colours = np.sort(np.concatenate((red, blue))) # for later use...

这里再次是这些红色和蓝色组的图像(以免您不得不向上滚动):

Here is an image of these red and blue groups again (to save you from having to scroll up):

当我们看到每个点的颜色(即它属于哪个组)时,很容易估计每个组的均值和标准差.我们只是将红色和蓝色值传递给NumPy中的内置函数.例如:

When we can see the colour of each point (i.e. which group it belongs to), it's very easy to estimate the mean and standard deviation for each each group. We just pass the red and blue values to the builtin functions in NumPy. For example:

>>> np.mean(red)
2.802
>>> np.std(red)
0.871
>>> np.mean(blue)
6.932
>>> np.std(blue)
2.195

但是如果我们不能看到这些点的颜色怎么办?也就是说,每个点都用紫色代替了红色或蓝色.

But what if we can't see the colours of the points? That is, instead of red or blue, every point has been coloured purple.

要尝试恢复红色和蓝色组的均值和标准差参数,我们可以使用期望最大化.

To try and recover the mean and standard deviation parameters for the red and blue groups, we can use Expectation Maximisation.

我们的第一步(上面的步骤1 )是猜测每个组的均值和标准差的参数值.我们不必聪明地猜测;我们可以选择任何喜欢的数字:

Our first step (step 1 above) is to guess at the parameter values for each group's mean and standard deviation. We don't have to guess intelligently; we can pick any numbers we like:

# estimates for the mean
red_mean_guess = 1.1
blue_mean_guess = 9

# estimates for the standard deviation
red_std_guess = 2
blue_std_guess = 1.7

这些参数估计会产生如下所示的钟形曲线:

These parameter estimates produce bell curves that look like this:

这些是错误的估计.例如,对于有意义的点组,这两种方式(垂直虚线)都远离任何类型的中间".我们希望改善这些估计.

These are bad estimates. Both means (the vertical dotted lines) look far off any kind of "middle" for sensible groups of points, for instance. We want to improve these estimates.

下一步(步骤2 )是计算每个数据点在当前参数猜测下出现的可能性:

The next step (step 2) is to compute the likelihood of each data point appearing under the current parameter guesses:

likelihood_of_red = stats.norm(red_mean_guess, red_std_guess).pdf(both_colours)
likelihood_of_blue = stats.norm(blue_mean_guess, blue_std_guess).pdf(both_colours)

在这里,我们只需将每个数据点放入概率密度函数中进行正态分布使用我们目前对红色和蓝色的均值和标准差的猜测.例如,这告诉我们,根据当前的猜测,位于1.761的数据点更有可能是红色(0.189)而不是蓝色(0.00003).

Here, we have simply put each data point into the probability density function for a normal distribution using our current guesses at the mean and standard deviation for red and blue. This tells us, for example, that with our current guesses the data point at 1.761 is much more likely to be red (0.189) than blue (0.00003).

对于每个数据点,我们可以将这两个似然值转换为权重(第3步),以使它们的总和为1,如下所示:

For each data point, we can turn these two likelihood values into weights (step 3) so that they sum to 1 as follows:

likelihood_total = likelihood_of_red + likelihood_of_blue

red_weight = likelihood_of_red / likelihood_total
blue_weight = likelihood_of_blue / likelihood_total

利用我们当前的估算值和新计算的权重,我们现在可以计算红色和蓝色组的均值和标准差的 new 估算值(第4步)

With our current estimates and our newly-computed weights, we can now compute new estimates for the mean and standard deviation of the red and blue groups (step 4).

我们使用 all 数据点两次计算平均值和标准差,但是权重不同:一次是红色权重,一次是蓝色权重.

We twice compute the mean and standard deviation using all data points, but with the different weightings: once for the red weights and once for the blue weights.

直觉的关键在于,颜色在数据点上的权重越大,数据点对该颜色参数的下一个估计值的影响就越大.这具有沿正确方向拉"参数的效果.

The key bit of intuition is that the greater the weight of a colour on a data point, the more the data point influences the next estimates for that colour's parameters. This has the effect of "pulling" the parameters in the right direction.

def estimate_mean(data, weight):
    """
    For each data point, multiply the point by the probability it
    was drawn from the colour's distribution (its "weight").

    Divide by the total weight: essentially, we're finding where 
    the weight is centred among our data points.
    """
    return np.sum(data * weight) / np.sum(weight)

def estimate_std(data, weight, mean):
    """
    For each data point, multiply the point's squared difference
    from a mean value by the probability it was drawn from
    that distribution (its "weight").

    Divide by the total weight: essentially, we're finding where 
    the weight is centred among the values for the difference of
    each data point from the mean.

    This is the estimate of the variance, take the positive square
    root to find the standard deviation.
    """
    variance = np.sum(weight * (data - mean)**2) / np.sum(weight)
    return np.sqrt(variance)

# new estimates for standard deviation
blue_std_guess = estimate_std(both_colours, blue_weight, blue_mean_guess)
red_std_guess = estimate_std(both_colours, red_weight, red_mean_guess)

# new estimates for mean
red_mean_guess = estimate_mean(both_colours, red_weight)
blue_mean_guess = estimate_mean(both_colours, blue_weight)

我们对参数有了新的估计.为了再次改善它们,我们可以跳回到步骤2并重复该过程.我们一直这样做,直到估计收敛为止,或者在执行了多次迭代之后(第5步).

We have new estimates for the parameters. To improve them again, we can jump back to step 2 and repeat the process. We do this until the estimates converge, or after some number of iterations have been performed (step 5).

对于我们的数据,此过程的前五个迭代如下所示(最近的迭代具有更强的外观):

For our data, the first five iterations of this process look like this (recent iterations have stronger appearance):

我们看到均值已经在一些值上收敛,并且曲线的形状(由标准偏差控制)也变得更加稳定.

We see that the means are already converging on some values, and the shapes of the curves (governed by the standard deviation) are also becoming more stable.

如果我们继续进行20次迭代,最终结果如下:

If we continue for 20 iterations, we end up with the following:

EM过程已收敛到以下值,这些值非常接近实际值(在这里我们可以看到颜色-没有隐藏的变量):

The EM process has converged to the following values, which turn out to very close to the actual values (where we can see the colours - no hidden variables):

          | EM guess | Actual |  Delta
----------+----------+--------+-------
Red mean  |    2.910 |  2.802 |  0.108
Red std   |    0.854 |  0.871 | -0.017
Blue mean |    6.838 |  6.932 | -0.094
Blue std  |    2.227 |  2.195 |  0.032

在上面的代码中,您可能已经注意到,标准差的新估计是使用先前迭代的均值来计算的.最终,我们是否首先为平均值计算新值并不重要,因为我们只是在某个中心点附近找到值的(加权)方差.我们仍然会看到参数的估计值收敛.

In the code above you may have noticed that the new estimation for standard deviation was computed using the previous iteration's estimate for the mean. Ultimately it does not matter if we compute a new value for the mean first as we are just finding the (weighted) variance of values around some central point. We will still see the estimates for the parameters converge.

这篇关于期望最大化技术的直观解释是什么?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆