使用MATLAB进行简单的二进制逻辑回归 [英] Simple binary logistic regression using MATLAB
问题描述
我正在使用MATLAB进行逻辑回归,以解决一个简单的分类问题.我的协变量是一个介于0和1之间的连续变量,而我的分类响应是二进制变量0(不正确)或1(正确).
我正在寻找一个逻辑回归来建立一个预测变量,该预测变量将输出某些输入观测值(例如上述连续变量)正确或不正确的概率.尽管这是一个非常简单的场景,但是在MATLAB中运行它时遇到了一些麻烦.
我的方法如下:我有一个包含连续变量值的列向量 X
和另一个包含已知变量的相等大小的列向量 Y
X
的每个值的分类(例如0或1).我正在使用以下代码:
[b,dev,stats] = glmfit(X,Y,'binomial','link','logit');
但是,这给了我无意义的结果,其中 p = 1.000
,极高的系数( b
)(-650.5、1320.1)以及相关的标准误差值大约是1e6.
然后我尝试使用附加参数指定我的二项式样本的大小:
glm = GeneralizedLinearModel.fit(X,Y,'distr','binomial','BinomialSize',size(Y,1));
这给了我更符合我预期的结果.我提取了系数,使用 glmval
创建了估算值( Y_fit = glmval(b,[0:0.01:1],'logit');
),并创建了一个数组拟合( X_fit = linspace(0,1)
).当我使用 figure,plot(X,Y,'o',X_fit,Y_fit'-')
覆盖原始数据和模型的图时,模型的结果图基本上像logistic回归图通常采用的"S"形图的下1/4.
我的问题如下:
1)为什么我使用 glmfit
会产生奇怪的结果?
2)我应该如何解决最初的问题:给定一些输入值,分类正确的可能性是多少?
3)如何获得模型参数的置信区间? glmval
应该能够输入 glmfit
的 stats
输出,但是我对 glmfit
的使用并没有给出正确的结果.
任何评论和输入都将非常有用,谢谢!
UPDATE(3/18/14)
我发现 mnrval
似乎给出了合理的结果.我可以使用 [b_fit,dev,stats] = mnrfit(X,Y + 1);
,其中 Y + 1
只是使我的二进制分类器成为名义分类器.>
我可以遍历 [pihat,lower,upper] = mnrval(b_fit,loopVal(ii),stats);
以获得各种 pihat
概率值,其中 loopVal = linspace(0,1)
或一些适当的输入范围,以及ii = 1:length(loopVal).
stats 参数具有很大的相关系数(0.9973),但是 b_fit
的p值为0.0847和0.0845,我不太确定如何解释.有什么想法吗?另外,在我的示例中,为什么 mrnfit
在 glmfit
上起作用?我应该注意,使用 GeneralizedLinearModel.fit
时,系数的p值均为 p << 0.001
,并且系数估算值也大不相同.
最后,如何解释 mnrfit
函数的 dev
输出?MATLAB文档指出,这是拟合度在解矢量处的偏差.偏差是残差平方和的一般化".这是否可以用作独立值,或者仅与其他模型的 dev
值进行比较?
听起来您的数据可能是线性可分离的.简而言之,这意味着由于您的输入数据是一维的,因此存在 x
的某些值,因此 x 属于一类(例如
y = 0
),所有 x>xDiv
属于另一类( y = 1
).
如果数据是二维的,则意味着您可以在二维空间 X
上画一条线,以使特定类的所有实例都位于该线的一侧.
这对于逻辑回归(LR)来说是个坏消息,因为LR并不是要处理数据可线性分离的问题.
Logistic回归正试图拟合以下形式的函数:
当分母中指数内的表达式为负无穷大或无穷大时,这将仅返回 y = 0
或 y = 1
的值.
现在,由于您的数据是线性可分离的,并且Matlab的LR函数尝试找到适合数据的最大似然,因此您将获得极高的权重值.
这不一定是解决方案,但尝试仅翻转一个数据点上的标签(因此对于某些索引 t
,其中 y(t)== 0
设置 y(t)= 1
).这将导致您的数据不再是线性可分离的,并且学习到的权重值将被极大地拖到接近零的位置.
I'm working on doing a logistic regression using MATLAB for a simple classification problem. My covariate is one continuous variable ranging between 0 and 1, while my categorical response is a binary variable of 0 (incorrect) or 1 (correct).
I'm looking to run a logistic regression to establish a predictor that would output the probability of some input observation (e.g. the continuous variable as described above) being correct or incorrect. Although this is a fairly simple scenario, I'm having some trouble running this in MATLAB.
My approach is as follows: I have one column vector X
that contains the values of the continuous variable, and another equally-sized column vector Y
that contains the known classification of each value of X
(e.g. 0 or 1). I'm using the following code:
[b,dev,stats] = glmfit(X,Y,'binomial','link','logit');
However, this gives me nonsensical results with a p = 1.000
, coefficients (b
) that are extremely high (-650.5, 1320.1), and associated standard error values on the order of 1e6.
I then tried using an additional parameter to specify the size of my binomial sample:
glm = GeneralizedLinearModel.fit(X,Y,'distr','binomial','BinomialSize',size(Y,1));
This gave me results that were more in line with what I expected. I extracted the coefficients, used glmval
to create estimates (Y_fit = glmval(b,[0:0.01:1],'logit');
), and created an array for the fitting (X_fit = linspace(0,1)
). When I overlaid the plots of the original data and the model using figure, plot(X,Y,'o',X_fit,Y_fit'-')
, the resulting plot of the model essentially looked like the lower 1/4th of the 'S' shaped plot that is typical with logistic regression plots.
My questions are as follows:
1) Why did my use of glmfit
give strange results?
2) How should I go about addressing my initial question: given some input value, what's the probability that its classification is correct?
3) How do I get confidence intervals for my model parameters? glmval
should be able to input the stats
output from glmfit
, but my use of glmfit
is not giving correct results.
Any comments and input would be very useful, thanks!
UPDATE (3/18/14)
I found that mnrval
seems to give reasonable results. I can use [b_fit,dev,stats] = mnrfit(X,Y+1);
where Y+1
simply makes my binary classifier into a nominal one.
I can loop through [pihat,lower,upper] = mnrval(b_fit,loopVal(ii),stats);
to get various pihat
probability values, where loopVal = linspace(0,1)
or some appropriate input range and `ii = 1:length(loopVal)'.
The stats
parameter has a great correlation coefficient (0.9973), but the p values for b_fit
are 0.0847 and 0.0845, which I'm not quite sure how to interpret. Any thoughts? Also, why would mrnfit
work over glmfit
in my example? I should note that the p-values for the coefficients when using GeneralizedLinearModel.fit
were both p<<0.001
, and the coefficient estimates were quite different as well.
Finally, how does one interpret the dev
output from the mnrfit
function? The MATLAB document states that it is "the deviance of the fit at the solution vector. The deviance is a generalization of the residual sum of squares." Is this useful as a stand-alone value, or is this only compared to dev
values from other models?
It sounds like your data may be linearly separable. In short, that means since your input data is one dimensional, that there is some value of x
such that all values of x < xDiv
belong to one class (say y = 0
) and all values of x > xDiv
belong to the other class (y = 1
).
If your data were two-dimensional this means you could draw a line through your two-dimensional space X
such that all instances of a particular class are on one side of the line.
This is bad news for logistic regression (LR) as LR isn't really meant to deal with problems where the data are linearly separable.
Logistic regression is trying to fit a function of the following form:
This will only return values of y = 0
or y = 1
when the expression within the exponential in the denominator is at negative infinity or infinity.
Now, because your data is linearly separable, and Matlab's LR function attempts to find a maximum likelihood fit for the data, you will get extreme weight values.
This isn't necessarily a solution, but try flipping the labels on just one of your data points (so for some index t
where y(t) == 0
set y(t) = 1
). This will cause your data to no longer be linearly separable and the learned weight values will be dragged dramatically closer to zero.
这篇关于使用MATLAB进行简单的二进制逻辑回归的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!