Logistic回归中的成本函数得出NaN作为结果 [英] Cost function in logistic regression gives NaN as a result

查看:199
本文介绍了Logistic回归中的成本函数得出NaN作为结果的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用批量梯度下降实现逻辑回归.输入样本分为两类.类是1和0.在训练数据时,我正在使用以下Sigmoid函数:

t = 1 ./ (1 + exp(-z));

其中

z = x*theta

我正在使用以下成本函数来计算成本,以确定何时停止培训.

function cost = computeCost(x, y, theta)
    htheta = sigmoid(x*theta);
    cost = sum(-y .* log(htheta) - (1-y) .* log(1-htheta));
end

由于大多数情况下htheta的值是1或0,所以我将每一步的成本设为NaN.如何确定每次迭代的成本值?

这是用于逻辑回归的梯度下降代码:

function [theta,cost_history] = batchGD(x,y,theta,alpha)

cost_history = zeros(1000,1);

for iter=1:1000
  htheta = sigmoid(x*theta);
  new_theta = zeros(size(theta,1),1);
  for feature=1:size(theta,1)
    new_theta(feature) = theta(feature) - alpha * sum((htheta - y) .*x(:,feature))                         
  end
  theta = new_theta;
  cost_history(iter) = computeCost(x,y,theta);
end
end

解决方案

这种情况可能有两个原因.

数据未规范化

这是因为当您将Sigmoid/logit函数应用于假设时,输出概率几乎都是0或全1,并且使用成本函数,log(1 - 1)log(0)会产生-Inf.您的成本函数中所有这些单独术语的累加最终将导致NaN.

具体来说,如果y = 0作为训练示例,并且假设的输出为log(x),其中x是非常小的数字,接近0,那么检查成本函数的第一部分将为我们提供帮助0*log(x),实际上会产生NaN.类似地,如果y = 1用于训练示例,并且假设的输出也是log(x),其中x是一个非常小的数字,那么这又会给我们0*log(x)并将产生NaN.简而言之,假设的输出非常接近0或非常接近1.

这很可能是由于每个功能的动态范围差异很大,因此是假设的一部分,特别是每个训练示例的x*theta加权总和将给您带来很大的负值或正值,如果将sigmoid函数应用于这些值,则会非常接近0或1.

解决这个问题的一种方法是在进行梯度下降训练之前,对矩阵中的数据进行规范化.一种典型的方法是使用零均值和单位方差进行归一化.给定输入功能x_k,其中k = 1, 2, ... n其中具有n功能,可以通过以下方式找到新的规范化功能x_k^{new}:

m_k是特征k的平均值,而s_k是特征k的标准偏差.这也称为标准化数据.您可以在我在此处给出的另一个答案上阅读有关此内容的更多详细信息: 解决方案

There are two possible reasons why this may be happening to you.

The data is not normalized

This is because when you apply the sigmoid / logit function to your hypothesis, the output probabilities are almost all approximately 0s or all 1s and with your cost function, log(1 - 1) or log(0) will produce -Inf. The accumulation of all of these individual terms in your cost function will eventually lead to NaN.

Specifically, if y = 0 for a training example and if the output of your hypothesis is log(x) where x is a very small number which is close to 0, examining the first part of the cost function would give us 0*log(x) and will in fact produce NaN. Similarly, if y = 1 for a training example and if the output of your hypothesis is also log(x) where x is a very small number, this again would give us 0*log(x) and will produce NaN. Simply put, the output of your hypothesis is either very close to 0 or very close to 1.

This is most likely due to the fact that the dynamic range of each feature is widely different and so a part of your hypothesis, specifically the weighted sum of x*theta for each training example you have will give you either very large negative or positive values, and if you apply the sigmoid function to these values, you'll get very close to 0 or 1.

One way to combat this is to normalize the data in your matrix before performing training using gradient descent. A typical approach is to normalize with zero-mean and unit variance. Given an input feature x_k where k = 1, 2, ... n where you have n features, the new normalized feature x_k^{new} can be found by:

m_k is the mean of the feature k and s_k is the standard deviation of the feature k. This is also known as standardizing data. You can read up on more details about this on another answer I gave here: How does this code for standardizing data work?

Because you are using the linear algebra approach to gradient descent, I'm assuming you have prepended your data matrix with a column of all ones. Knowing this, we can normalize your data like so:

mX = mean(x,1); 
mX(1) = 0; 
sX = std(x,[],1); 
sX(1) = 1; 
xnew = bsxfun(@rdivide, bsxfun(@minus, x, mX), sX);

The mean and standard deviations of each feature are stored in mX and sX respectively. You can learn how this code works by reading the post I linked to you above. I won't repeat that stuff here because that isn't the scope of this post. To ensure proper normalization, I've made the mean and standard deviation of the first column to be 0 and 1 respectively. xnew contains the new normalized data matrix. Use xnew with your gradient descent algorithm instead. Now once you find the parameters, to perform any predictions you must normalize any new test instances with the mean and standard deviation from the training set. Because the parameters learned are with respect to the statistics of the training set, you must also apply the same transformations to any test data you want to submit to the prediction model.

Assuming you have new data points stored in a matrix called xx, you would do normalize then perform the predictions:

xxnew = bsxfun(@rdivide, bsxfun(@minus, xx, mX), sX);

Now that you have this, you can perform your predictions:

pred = sigmoid(xxnew*theta) >= 0.5;

You can change the threshold of 0.5 to be whatever you believe is best that determines whether examples belong in the positive or negative class.

The learning rate is too large

As you mentioned in the comments, once you normalize the data the costs appear to be finite but then suddenly go to NaN after a few iterations. Normalization can only get you so far. If your learning rate or alpha is too large, each iteration will overshoot in the direction towards the minimum and would thus make the cost at each iteration oscillate or even diverge which is what is appearing to be happening. In your case, the cost is diverging or increasing at each iteration to the point where it is so large that it can't be represented using floating point precision.

As such, one other option is to decrease your learning rate alpha until you see that the cost function is decreasing at each iteration. A popular method to determine what the best learning rate would be is to perform gradient descent on a range of logarithmically spaced values of alpha and seeing what the final cost function value is and choosing the learning rate that resulted in the smallest cost.


Using the two facts above together should allow gradient descent to converge quite nicely, assuming that the cost function is convex. In this case for logistic regression, it most certainly is.

这篇关于Logistic回归中的成本函数得出NaN作为结果的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆