逻辑回归中的成本函数给出 NaN 结果 [英] Cost function in logistic regression gives NaN as a result

查看:79
本文介绍了逻辑回归中的成本函数给出 NaN 结果的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用批量梯度下降实现逻辑回归.有两类输入样本将被分类.类是 1 和 0.在训练数据时,我使用了以下 sigmoid 函数:

I am implementing logistic regression using batch gradient descent. There are two classes into which the input samples are to be classified. The classes are 1 and 0. While training the data, I am using the following sigmoid function:

t = 1 ./ (1 + exp(-z));

哪里

z = x*theta

我正在使用以下成本函数来计算成本,以确定何时停止训练.

And I am using the following cost function to calculate cost, to determine when to stop training.

function cost = computeCost(x, y, theta)
    htheta = sigmoid(x*theta);
    cost = sum(-y .* log(htheta) - (1-y) .* log(1-htheta));
end

我得到每一步的成本都是 NaN,因为在大多数情况下 htheta 的值要么是 1,要么是零.我应该怎么做才能确定每次迭代的成本值?

I am getting the cost at each step to be NaN as the values of htheta are either 1 or zero in most cases. What should I do to determine the cost value at each iteration?

这是逻辑回归的梯度下降代码:

This is the gradient descent code for logistic regression:

function [theta,cost_history] = batchGD(x,y,theta,alpha)

cost_history = zeros(1000,1);

for iter=1:1000
  htheta = sigmoid(x*theta);
  new_theta = zeros(size(theta,1),1);
  for feature=1:size(theta,1)
    new_theta(feature) = theta(feature) - alpha * sum((htheta - y) .*x(:,feature))                         
  end
  theta = new_theta;
  cost_history(iter) = computeCost(x,y,theta);
end
end

推荐答案

出现这种情况的原因有两个.

There are two possible reasons why this may be happening to you.

这是因为当您将 sigmoid/logit 函数应用于您的假设时,输出概率几乎全部为 0 或全部为 1,并且您的成本函数 log(1 - 1)log(0) 将产生 -Inf.成本函数中所有这些单个项的累积最终将导致 NaN.

This is because when you apply the sigmoid / logit function to your hypothesis, the output probabilities are almost all approximately 0s or all 1s and with your cost function, log(1 - 1) or log(0) will produce -Inf. The accumulation of all of these individual terms in your cost function will eventually lead to NaN.

具体来说,如果 y = 0 对于训练示例并且假设的输出是 log(x) 其中 x 是非常接近 0 的非常小的数字,检查成本函数的第一部分会给我们 0*log(x) 并且实际上会产生 NaN.类似地,如果 y = 1 对于训练示例并且假设的输出也是 log(x) 其中 x 是一个非常小的数字,这又会给我们 0*log(x) 并且会产生 NaN.简单地说,你的假设的输出要么非常接近 0,要么非常接近 1.

Specifically, if y = 0 for a training example and if the output of your hypothesis is log(x) where x is a very small number which is close to 0, examining the first part of the cost function would give us 0*log(x) and will in fact produce NaN. Similarly, if y = 1 for a training example and if the output of your hypothesis is also log(x) where x is a very small number, this again would give us 0*log(x) and will produce NaN. Simply put, the output of your hypothesis is either very close to 0 or very close to 1.

这很可能是因为每个特征的动态范围差异很大,因此是您假设的一部分,特别是您拥有的每个训练示例的 x*theta 的加权和会给你非常大的负值或正值,如果你对这些值应用 sigmoid 函数,你会非常接近 0 或 1.

This is most likely due to the fact that the dynamic range of each feature is widely different and so a part of your hypothesis, specifically the weighted sum of x*theta for each training example you have will give you either very large negative or positive values, and if you apply the sigmoid function to these values, you'll get very close to 0 or 1.

解决此问题的一种方法是在使用梯度下降进行训练之前对矩阵中的数据进行标准化.典型的方法是使用零均值和单位方差进行归一化.给定一个输入特征 x_k,其中 k = 1, 2, ... n 其中你有 n 个特征,新的归一化特征 x_k^{new} 可以通过以下方式找到:

One way to combat this is to normalize the data in your matrix before performing training using gradient descent. A typical approach is to normalize with zero-mean and unit variance. Given an input feature x_k where k = 1, 2, ... n where you have n features, the new normalized feature x_k^{new} can be found by:

m_k 是特征 k 的均值,s_k 是特征 k 的标准差.这也称为标准化数据.你可以在我在这里给出的另一个答案中阅读更多细节:此用于标准化数据的代码如何工作?

m_k is the mean of the feature k and s_k is the standard deviation of the feature k. This is also known as standardizing data. You can read up on more details about this on another answer I gave here: How does this code for standardizing data work?

因为您正在使用线性代数方法进行梯度下降,所以我假设您已经在数据矩阵前面加上了一列全数.知道这一点,我们可以像这样规范化您的数据:

Because you are using the linear algebra approach to gradient descent, I'm assuming you have prepended your data matrix with a column of all ones. Knowing this, we can normalize your data like so:

mX = mean(x,1); 
mX(1) = 0; 
sX = std(x,[],1); 
sX(1) = 1; 
xnew = bsxfun(@rdivide, bsxfun(@minus, x, mX), sX);

每个特征的均值和标准差分别存储在mXsX中.您可以通过阅读我在上面链接到您的帖子来了解此代码的工作原理.我不会在这里重复那些东西,因为那不是这篇文章的范围.为了确保正确的标准化,我将第一列的均值和标准差分别设为 0 和 1.xnew 包含新的归一化数据矩阵.使用 xnew 代替您的梯度下降算法.现在,一旦找到参数,要执行任何预测,您必须使用训练集的均值和标准差对任何新测试实例进行标准化.由于学习的参数与训练集的统计数据有关,因此您还必须对要提交给预测模型的任何测试数据应用相同的转换.

The mean and standard deviations of each feature are stored in mX and sX respectively. You can learn how this code works by reading the post I linked to you above. I won't repeat that stuff here because that isn't the scope of this post. To ensure proper normalization, I've made the mean and standard deviation of the first column to be 0 and 1 respectively. xnew contains the new normalized data matrix. Use xnew with your gradient descent algorithm instead. Now once you find the parameters, to perform any predictions you must normalize any new test instances with the mean and standard deviation from the training set. Because the parameters learned are with respect to the statistics of the training set, you must also apply the same transformations to any test data you want to submit to the prediction model.

假设您将新数据点存储在名为 xx 的矩阵中,您将进行归一化然后执行预测:

Assuming you have new data points stored in a matrix called xx, you would do normalize then perform the predictions:

xxnew = bsxfun(@rdivide, bsxfun(@minus, xx, mX), sX);

现在你有了这个,你可以执行你的预测:

Now that you have this, you can perform your predictions:

pred = sigmoid(xxnew*theta) >= 0.5;

您可以将阈值 0.5 更改为您认为最能确定示例属于正类还是负类的任何值.

You can change the threshold of 0.5 to be whatever you believe is best that determines whether examples belong in the positive or negative class.

正如您在评论中提到的,一旦您对数据进行标准化,成本似乎是有限的,但经过几次迭代后突然变为 NaN.规范化只能让你走到这一步.如果您的学习率或 alpha 太大,每次迭代都会朝着最小值的方向超调,从而使每次迭代的成本振荡甚至发散,这似乎正在发生.在您的情况下,成本在每次迭代时都会发散或增加,以至于无法使用浮点精度表示.

As you mentioned in the comments, once you normalize the data the costs appear to be finite but then suddenly go to NaN after a few iterations. Normalization can only get you so far. If your learning rate or alpha is too large, each iteration will overshoot in the direction towards the minimum and would thus make the cost at each iteration oscillate or even diverge which is what is appearing to be happening. In your case, the cost is diverging or increasing at each iteration to the point where it is so large that it can't be represented using floating point precision.

因此,另一种选择是降低您的学习率alpha,直到您看到每次迭代的成本函数都在下降.确定最佳学习率的一种流行方法是对 alpha 的一系列对数间隔值执行梯度下降,然后查看最终成本函数值是多少,然后选择导致结果的学习率最小的成本.

As such, one other option is to decrease your learning rate alpha until you see that the cost function is decreasing at each iteration. A popular method to determine what the best learning rate would be is to perform gradient descent on a range of logarithmically spaced values of alpha and seeing what the final cost function value is and choosing the learning rate that resulted in the smallest cost.

一起使用上述两个事实应该可以让梯度下降很好地收敛,假设成本函数是凸的.在这种逻辑回归的情况下,它肯定是.

Using the two facts above together should allow gradient descent to converge quite nicely, assuming that the cost function is convex. In this case for logistic regression, it most certainly is.

这篇关于逻辑回归中的成本函数给出 NaN 结果的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆