使用MATLAB进行简单的二进制逻辑回归 [英] Simple binary logistic regression using MATLAB

查看:208
本文介绍了使用MATLAB进行简单的二进制逻辑回归的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用MATLAB进行逻辑回归,以解决一个简单的分类问题.我的协变量是一个介于0和1之间的连续变量,而我的分类响应是二进制变量0(不正确)或1(正确).

我正在寻找一个逻辑回归来建立一个预测变量,该预测变量将输出某些输入观测值(例如上述连续变量)正确或不正确的概率.尽管这是一个非常简单的场景,但是在MATLAB中运行它时遇到了一些麻烦.

我的方法如下:我有一个包含连续变量值的列向量 X 和另一个包含已知变量的相等大小的列向量 Y X 的每个值的分类(例如0或1).我正在使用以下代码:

[b,dev,stats] = glmfit(X,Y,'binomial','link','logit');

但是,这给了我无意义的结果,其中 p = 1.000 ,极高的系数( b )(-650.5、1320.1)以及相关的标准误差值大约是1e6.

然后我尝试使用附加参数指定我的二项式样本的大小:

glm = GeneralizedLinearModel.fit(X,Y,'distr','binomial','BinomialSize',size(Y,1));

这给了我更符合我预期的结果.我提取了系数,使用 glmval 创建了估算值( Y_fit = glmval(b,[0:0.01:1],'logit'); ),并创建了一个数组拟合( X_fit = linspace(0,1)).当我使用 figure,plot(X,Y,'o',X_fit,Y_fit'-')覆盖原始数据和模型的图时,模型的结果图基本上像logistic回归图通常采用的"S"形图的下1/4.

我的问题如下:

1)为什么我使用 glmfit 会产生奇怪的结果?
2)我应该如何解决最初的问题:给定一些输入值,分类正确的可能性是多少?
3)如何获得模型参数的置信区间? glmval 应该能够输入 glmfit stats 输出,但是我对 glmfit 的使用并没有给出正确的结果.

任何评论和输入都将非常有用,谢谢!

UPDATE(3/18/14)

我发现 mnrval 似乎给出了合理的结果.我可以使用 [b_fit,dev,stats] = mnrfit(X,Y + 1); ,其中 Y + 1 只是使我的二进制分类器成为名义分类器.

我可以遍历 [pihat,lower,upper] = mnrval(b_fit,loopVal(ii),stats); 以获得各种 pihat 概率值,其中 loopVal = linspace(0,1)或一些适当的输入范围,以及ii = 1:length(loopVal).

stats 参数具有很大的相关系数(0.9973),但是 b_fit 的p值为0.0847和0.0845,我不太确定如何解释.有什么想法吗?另外,在我的示例中,为什么 mrnfit glmfit 上起作用?我应该注意,使用 GeneralizedLinearModel.fit 时,系数的p值均为 p << 0.001 ,并且系数估算值也大不相同.

最后,如何解释 mnrfit 函数的 dev 输出?MATLAB文档指出,这是拟合度在解矢量处的偏差.偏差是残差平方和的一般化".这是否可以用作独立值,或者仅与其他模型的 dev 值进行比较?

解决方案

听起来您的数据可能是线性可分离的.简而言之,这意味着由于您的输入数据是一维的,因此存在 x 的某些值,因此 x 属于一类(例如 y = 0 ),所有 x>xDiv 属于另一类( y = 1 ).

如果数据是二维的,则意味着您可以在二维空间 X 上画一条线,以使特定类的所有实例都位于该线的一侧.

这对于逻辑回归(LR)来说是个坏消息,因为LR并不是要处理数据可线性分离的问题.

Logistic回归正试图拟合以下形式的函数:

当分母中指数内的表达式为负无穷大或无穷大时,这将仅返回 y = 0 y = 1 的值.

现在,由于您的数据是线性可分离的,并且Matlab的LR函数尝试找到适合数据的最大似然,因此您将获得极高的权重值.

这不一定是解决方案,但尝试仅翻转一个数据点上的标签(因此对于某些索引 t ,其中 y(t)== 0 设置 y(t)= 1 ).这将导致您的数据不再是线性可分离的,并且学习到的权重值将被极大地拖到接近零的位置.

I'm working on doing a logistic regression using MATLAB for a simple classification problem. My covariate is one continuous variable ranging between 0 and 1, while my categorical response is a binary variable of 0 (incorrect) or 1 (correct).

I'm looking to run a logistic regression to establish a predictor that would output the probability of some input observation (e.g. the continuous variable as described above) being correct or incorrect. Although this is a fairly simple scenario, I'm having some trouble running this in MATLAB.

My approach is as follows: I have one column vector X that contains the values of the continuous variable, and another equally-sized column vector Y that contains the known classification of each value of X (e.g. 0 or 1). I'm using the following code:

[b,dev,stats] = glmfit(X,Y,'binomial','link','logit');

However, this gives me nonsensical results with a p = 1.000, coefficients (b) that are extremely high (-650.5, 1320.1), and associated standard error values on the order of 1e6.

I then tried using an additional parameter to specify the size of my binomial sample:

glm = GeneralizedLinearModel.fit(X,Y,'distr','binomial','BinomialSize',size(Y,1));

This gave me results that were more in line with what I expected. I extracted the coefficients, used glmval to create estimates (Y_fit = glmval(b,[0:0.01:1],'logit');), and created an array for the fitting (X_fit = linspace(0,1)). When I overlaid the plots of the original data and the model using figure, plot(X,Y,'o',X_fit,Y_fit'-'), the resulting plot of the model essentially looked like the lower 1/4th of the 'S' shaped plot that is typical with logistic regression plots.

My questions are as follows:

1) Why did my use of glmfit give strange results?
2) How should I go about addressing my initial question: given some input value, what's the probability that its classification is correct?
3) How do I get confidence intervals for my model parameters? glmval should be able to input the stats output from glmfit, but my use of glmfit is not giving correct results.

Any comments and input would be very useful, thanks!

UPDATE (3/18/14)

I found that mnrval seems to give reasonable results. I can use [b_fit,dev,stats] = mnrfit(X,Y+1); where Y+1 simply makes my binary classifier into a nominal one.

I can loop through [pihat,lower,upper] = mnrval(b_fit,loopVal(ii),stats); to get various pihat probability values, where loopVal = linspace(0,1) or some appropriate input range and `ii = 1:length(loopVal)'.

The stats parameter has a great correlation coefficient (0.9973), but the p values for b_fit are 0.0847 and 0.0845, which I'm not quite sure how to interpret. Any thoughts? Also, why would mrnfit work over glmfit in my example? I should note that the p-values for the coefficients when using GeneralizedLinearModel.fit were both p<<0.001, and the coefficient estimates were quite different as well.

Finally, how does one interpret the dev output from the mnrfit function? The MATLAB document states that it is "the deviance of the fit at the solution vector. The deviance is a generalization of the residual sum of squares." Is this useful as a stand-alone value, or is this only compared to dev values from other models?

解决方案

It sounds like your data may be linearly separable. In short, that means since your input data is one dimensional, that there is some value of x such that all values of x < xDiv belong to one class (say y = 0) and all values of x > xDiv belong to the other class (y = 1).

If your data were two-dimensional this means you could draw a line through your two-dimensional space X such that all instances of a particular class are on one side of the line.

This is bad news for logistic regression (LR) as LR isn't really meant to deal with problems where the data are linearly separable.

Logistic regression is trying to fit a function of the following form:

This will only return values of y = 0 or y = 1 when the expression within the exponential in the denominator is at negative infinity or infinity.

Now, because your data is linearly separable, and Matlab's LR function attempts to find a maximum likelihood fit for the data, you will get extreme weight values.

This isn't necessarily a solution, but try flipping the labels on just one of your data points (so for some index t where y(t) == 0 set y(t) = 1). This will cause your data to no longer be linearly separable and the learned weight values will be dragged dramatically closer to zero.

这篇关于使用MATLAB进行简单的二进制逻辑回归的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆