通过线性判别分析或逻辑回归评估/改进预测 [英] Assessing/Improving prediction with linear discriminant analysis or logistic regression

查看:116
本文介绍了通过线性判别分析或逻辑回归评估/改进预测的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我最近需要在某个数据集上组合两个或多个变量,以评估它们的组合是否可以增强预测性,因此我在R中进行了逻辑回归.现在,在统计问答中,有人建议我可以使用线性判别分析.

I recently needed to combine two or more variables on some data set to evaluate if their combination could enhance predictivity, thus I made some logistic regression in R. Now, on the statistic Q&A, someone suggested that I may use the linear discriminant analysis.

由于我在MATLAB中没有任何fitcdiscr.m,因此我宁愿在R中使用lda,但我无法使用拟合结果来预测AUC或我可以使用的任何值.确实,我看到R中lda的拟合输出向量是具有多个类的某种向量,我想我应该使用fit$posterior来预测控制案例,但我不能从中取出那些数据.

Since I don't have any fitcdiscr.m in MATLAB, I'd rather go with lda in R but I cannot use the fit results to predict AUC or whatever I could use. Indeed, I see that fit output vector of lda in R is some sort of vector with multiple classes and I guess I should use fit$posterior to predict Cases against Controls, but I cannot take those data out of it.

有关更多信息,我得到的结果为fit$posterior:

For further information, I get this results as fit$posterior:

$posterior
            0          1
1   0.7707927 0.22920726
2   0.7085165 0.29148352
3   0.6990989 0.30090106
4   0.5902161 0.40978387
5   0.8667109 0.13328912
6   0.6924406 0.30755939
7   0.7471086 0.25289141
8   0.7519326 0.24806736

依次类推,直到最后一个观察值为242.例如,每次尝试获取fit$posterior[,1]的第1列时,我都会得到:

And so on up to the last observation which is 242. Every time I try to take, for example, column 1 by fit$posterior[,1], I get:

        1         2         3         4         5         6         7         8 
0.7707927 0.7085165 0.6990989 0.5902161 0.8667109 0.6924406 0.7471086 0.7519326 
        9        10        11        12        13        14        15        16 
0.7519326 0.6902850 0.7519326 0.8080445 0.8075360 0.8484318 0.4860899 0.8694121

我不知道代码的哪一部分有用,因为我进行了非常基本的计算:

I don't know which part of the code could be useful, since I made very basic computation:

library(gdata)
data=read.xls("ECGvarious.xls", perl="C:/Strawberry/perl/bin/perl.exe");
i=6;
p=19;
temp=data[,i];
temp1=data[, p];
library(MASS)
fit <- lda(Case ~ temp + temp , data=data, na.action="na.omit", CV=TRUE)

我无法链接数据,无论如何,ECGvarious只是N个观察值x P变量,即N = N1 + N2,其中N1是对照的数量,N2是病例的数量,并且病例被定义为发展出病理学的受试者经过跟进.控件和案例的最后一列数据分别仅为0或1.

I can't link the data, anyway ECGvarious is simply an N observation x P variables, being N= N1+ N2 with N1 the number of Controls and N2 the number of Cases, and the Cases are defined as subjects who developed pathology after a follow up. The very last column of data is just 0 or 1 for Controls and Cases, respectively.

当我执行逻辑回归时,我做了:

When I performed the logistic regression, I did:

mod1<-glm(Case ~ temp + temp1, data=data,     family="binomial"); 
auctemp=auc(Case~predict(mod1), data=data);

推荐答案

这是我关于逻辑回归和预测的输入(我对线性判别了解不多,但了解它与逻辑回归密切相关,我知道的要多得多).我不确定我是否会遵循您的所有推理方法,也不一定会得到令人满意的答案,但希望它不会对您造成伤害.这是对我一些流行病学课程的回顾.我希望它不太正式,至少可以部分解决您的一些问题.如果不是这样,并且如果其他用户认为这最好属于交叉验证",那么我不会冒犯. :)

Here's my input concerning logistic regression and prediction (I don't know much about linear discrimination but understand it's closely related to logistic regression, which I know much better). I'm not sure I'm following all of your reasoning, nor if this will be a satisfactory answer, but hopefully it won't hurt. This has been a review of some epidemiology classes for me. I hope it's not too formal and addresses at least in part some of your questions. If not, and if other users think this would better belong on Cross Validated, I won't take offense. :)

我们将首先生成200个观察值,对于Case = 1,这些观察值的概率在增加.第一个预测变量(pred1)将遵循非线性分布,接近进行逻辑回归时要建模的分布.这将与案件的比例密切相关.第二个预测变量将是随机的,均匀分布的噪声.

We'll first generate 200 observations, having increasing levels of probability for Case=1. The first predictor (pred1) will follow a distribution that is nonlinear, close to the one being modeled when doing logistic regression. It will be rather closely related to the proportion of Cases. The second predictor will just be random, uniformly distributed noise.

set.seed(2351)
df <- data.frame(Case = c(sample(c(0,1), size = 67, prob = c(0.8, 0.2), replace = TRUE), 
                          sample(c(0,1), size = 66, prob = c(0.5, 0.5), replace = TRUE), 
                          sample(c(0,1), size = 67, prob = c(0.2, 0.8), replace = TRUE)),
                 pred1 = 6/(1+4*exp(-seq(from = -3, to = 5, length.out = 200))) + rnorm(n = 200, mean = 2, sd=.5),
                 pred2 = runif(n = 200, min = 0, max = 100))

我们在下面的方框图中看到,case==1的观测值通常具有更高的pred1,这是有意提供的(根据我们生成数据的方式).同时存在重叠,否则将很难决定截止点/阈值.

We see in the boxplot below that the observations where case==1 generally have higher pred1, which is intended (from the way we generated the data). At the same time, there is an overlap, otherwise it would make it too easy to decide on a cutoff point/threshold.

boxplot(pred1 ~ Case, data=df, xlab="Case", ylab="pred1")

首先使用两个预测变量:

First using both predictors:

model.1 <- glm(Case ~ pred1 + pred2, data=df, family=binomial(logit))
summary(model.1)

# Coefficients:
#              Estimate Std. Error z value Pr(>|z|)    
# (Intercept) -2.058258   0.479094  -4.296 1.74e-05 ***
# pred1        0.428491   0.075373   5.685 1.31e-08 ***
# pred2        0.003399   0.005500   0.618    0.537    
# ---
# Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
# 
# (Dispersion parameter for binomial family taken to be 1)
# 
#     Null deviance: 276.76  on 199  degrees of freedom
# Residual deviance: 238.51  on 197  degrees of freedom
# AIC: 244.51

正如我们期望的那样,第一个预测变量与结果相关性很强,第二个与结果相关性很差.

As we'd expect, the first predictor is rather strongly related, and the second, poorly related to the outcome.

请注意,要从这些系数中获得赔率,我们需要对它们求幂:

Note that to get Odds Ratios from those coefficients, we need to exponentiate them:

exp(model.1$coefficients[2:3])

#    pred1    pred2 
# 1.534939 1.003405   # Odds Ratios (making the relationships appear more clearly). 
                      # Use `exp(confint(model.1))` to get confidence intervals.

我们将将此模型与更简单的模型进行比较,删除第二个预测变量:

We'll compare this model to a simpler model, removing the second predictor:

model.2 <- glm(Case ~ pred1, data=df, family=binomial(logit))
summary(model.2)

# Coefficients:
#             Estimate Std. Error z value Pr(>|z|)    
# (Intercept) -1.87794    0.37452  -5.014 5.32e-07 ***
# pred1        0.42651    0.07514   5.676 1.38e-08 ***
# ---
# Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#
# (Dispersion parameter for binomial family taken to be 1)
#
#     Null deviance: 276.76  on 199  degrees of freedom
# Residual deviance: 238.89  on 198  degrees of freedom
# AIC: 242.89

exp(model.2$coefficients)[2]

#    pred1 
# 1.531907  # Odds Ratio

我们也可以运行anova(model.1, model.2),但是我们跳过这一部分,继续进行预测,保持这个简单的模型,因为第二个变量不会增加太多的预测值(如果有的话).实际上,拥有更多的预测变量几乎不是问题,除非它是真正的随机噪声,但是在这里,我将重点更多放在预测和选择适当阈值的操作上.

We could also run an anova(model.1, model.2), but let's skip this part and move on to prediction, keeping this simpler model as the second variable doesn't add much predictive value, if any. In practive, having more predictors is rarely a problem unless it's truly random noise, but here I focus more on the operation of predicting and choosing a proper threshold.

model.2对象(列表)中,有一个名为fitted.values的项目.这些值与我们从predict(model.2, type="response")获得的值完全相同,可以解释为概率.基于预测变量及其系数,每行对应一个.

In the model.2 object (a list), there is an item named fitted.values. Those values are the exact same that we'd get from predict(model.2, type="response") and can be interpreted as probabilities; one for each row, based on the predictor(s) and their coefficient(s).

还可以预测不在我们初始数据框中的假设行的结果.

It is also possible to predict the outcome for hypothetical rows not in our initial dataframe.

使用model.1(2个预测变量):

With model.1 (2 predictors):

predict(model.1, newdata = list(pred1=1, pred2=42), type="response")

#         1 
# 0.1843701 

使用model.2(1个预测变量):

With model.2 (1 predictor):

predict(model.2, newdata = list(pred1=12), type="response")

#       1 
# 0.96232 


从概率到二元响应

回顾预测变量pred1与具有Case=1的计算概率之间的联系:


Going from probability to binary response

Looking back at the link between our predictor pred1 and the calculated probability of having Case=1:

plot(df$pred1, model.2$fitted.values, 
     xlab="pred1", ylab="probability that Case=1")

我们注意到,由于我们只有一个预测变量,因此概率是其直接函数.如果我们将其他预测变量保留在等式中,则会看到点围绕同一条线分组,但是在点云中.

但这并不能改变这样一个事实,即如果我们要评估模型对二进制结果的预测能力,我们需要确定一个阈值,高于该阈值,我们将认为观察结果是一个案例.一些软件包提供了帮助选择该阈值的工具.但是,即使没有任何其他程序包,我们也可以使用以下函数在一定阈值范围内计算各种属性,该函数将计算敏感性(检测真实的 Cases 的能力) ),特异性(识别真实的非个案的能力)和其他属性,此处.

But this doesn't change the fact that if we are to evaluate how well our model can predict binary outcomes, we need to settle on a threshold above which we'll consider that the observation is a Case. Several packages have tools to help picking that threshold. But even without any additional package, we can calculate various properties over a range of thresholds using a function such as the following, which will calculate the sensitivity (ability to detect True Cases), specificity (ability to identify True Non Cases), and other properties well described here.

df.ana <- data.frame(thresh=seq(from = 0, to = 100, by = 0.5) / 100)
for(i in seq_along(df.ana$thresh)) {
    df.ana$sensitivity[i] <- sum(df$Case==1 & (predict(model.2, type="resp") >= df.ana$thresh[i])) / sum(df$Case==1)
    df.ana$specificity[i] <- sum(df$Case==0 & (predict(model.2, type="resp") < df.ana$thresh[i])) / sum(df$Case==0)
    df.ana$pos.pred.value[i] <- sum(df$Case == 1 & (predict(model.2, type="resp") >= df.ana$thresh[i])) / sum(predict(model.2, type="resp") >= df.ana$thresh[i])
    df.ana$neg.pred.value[i] <- sum(df$Case == 0 & (predict(model.2, type="resp") < df.ana$thresh[i])) / sum(predict(model.2, type="resp") < df.ana$thresh[i])
    df.ana$accuracy[i] <- sum((predict(model.2, type="resp") >= df.ana$thresh[i]) == df$Case) / nrow(df)
}

which.max(df.ana$accuracy)

# [1] 46 

optimal.thresh <- df.ana$thresh[which.max(df.ana$accuracy)] # 0.46

准确度是正确预测在所有预测中的比例.对于该问题,第46个阈值(0.46)是最佳".让我们检查一下生成的数据框中的其他一些相邻行.它告诉我们0.47在所有方面都同样适用.进行微调会涉及向我们的初始数据帧中添加一些新数据.

The accuracy is the proportion of correct predictions over all predictions. The 46th threshold (0.46) is the "best" for that matter. Let's check a few other neighboring rows in the generated dataframe; it tells us that 0.47 would work as well on all fronts. Fine-tuning would involve adding some new data to our initial dataframe.

df.ana[45:48,]

#    thresh sensitivity specificity pos.pred.value neg.pred.value accuracy
# 45   0.45   0.7142857   0.6947368      0.7211538      0.6875000    0.705
# 46   0.46   0.7142857   0.7157895      0.7352941      0.6938776    0.715
# 47   0.47   0.7142857   0.7157895      0.7352941      0.6938776    0.715
# 48   0.48   0.7047619   0.7157895      0.7326733      0.6868687    0.710

请注意,auc函数(曲线下的区域 )将给出与该阈值的精度相同的数字:

Note that the auc function (area under the curve) will give the same number as the accuracy for that threshold:

library(pROC)
auc(Case ~ as.numeric(predict(model.2, type="response") >= optimal.thresh), data=df)

# Area under the curve: 0.715


一些情节

# thresholds against accuracy
plot(x=df.ana$thresh, y=df.ana$accuracy, type="l",
         xlab="Threshold", ylab="", xlim=c(0,1), ylim=c(0,1))
text(x = 0.1, y = 0.5, labels = "Accuracy", col="black")

# thresholds against Sensitivity 
lines(x=df.ana$thresh, y=df.ana$sensitivity, type="l",col="blue") # Sensitivity We want to maximize this, but not too much
text(x = 0.1, y = 0.95, labels = "Sensitivity", col="blue")

# thresholds against specificity 
lines(x=df.ana$thresh, y=df.ana$specificity, type="l", col="red") # Specificity we want to maximize also, but not too much
text(x = 0.1, y = 0.05, labels = "Specificity", col="red")

# optimal threshold vertical line
abline(v=optimal.thresh)
text(x=optimal.thresh + .01, y=0.05, labels= optimal.thresh)

偶然地,所有线条或多或少地收敛到同一点,这表明这是我们在预测工具中寻找的所有品质之间的良好折衷.但是根据您的目标,最好选择较低或较高的阈值.统计工具很有用,但最终,其他一些因素通常对于做出最终决定更为重要.

Incidentally, all lines converge more or less to the same point, which suggests this is a good compromise between all the qualities we look for in a predictive tool. But depending on your objectives, it might be better picking a lower or a higher threshold. Statistical tools are useful, but in the end, some other considerations are often more important in making a final decision.

以下图形与使用pROC的roc生成的图形相同:

The following graph is the same as the one which would be produced with pROC's roc:

plot(x=df.ana$specificity, y = df.ana$sensitivity, type="l", col="blue",
         xlim = c(1,0), xlab = "Specificity", ylab = "Sensitivity") 

# Equivalent to
# plot(roc(predictor=model.2$fitted.values, response = model.2$y))

以下功能可让您为逻辑模型拟合计算出与上述相同的统计信息,并为任何选定的阈值提供2x2的表格.

The following function allows one to calculate, for a logistic model fit, the same stats seen above, and gives a 2x2 table for any chosen threshold.

diagnos.test <- function(model, threshold) {
    output <- list()
    output$stats <- c(
      sensitivity = sum(model.1$y==1 & (predict(model, type="resp") >= threshold)) / sum(model.1$y==1),
      specificity = sum(model.1$y==0 & (predict(model, type="resp") < threshold)) / sum(model.1$y==0),
      pos.pr.value = sum(model.1$y==1 & (predict(model.2, type="resp") >= threshold)) / sum(predict(model.2, type="resp") >= threshold),
      neg.pr.value = sum(df$Case == 0 & (predict(model.2, type="resp") < threshold)) / sum(predict(model.2, type="resp") < threshold),
      accuracy = sum((predict(model.2, type="resp") >= threshold) == df$Case) / nrow(df))
    output$tab <- addmargins(t(table(model$y, as.numeric(predict(model, type="response") > threshold),dnn = list("Cases", "Predictions")))[2:1,2:1])
    return(output)
}

diagnos.test(model.2, 0.47)

# $stats
#  sensitivity  specificity pos.pr.value neg.pr.value     accuracy 
#    0.7142857    0.7157895    0.7352941    0.6938776    0.7150000 
# 
# $tab
#            Cases
# Predictions   1  0 Sum
#         1    75 27 102
#         0    30 68  98
#         Sum 105 95 200


最后的记录

我不假装我已经涵盖了预测,敏感性和特异性方面的所有内容;我的目标是尽可能多地使用通用语言和计算方式,而不依赖任何特定的软件包.

I don't pretend I have covered everything on prediction, sensitivity and specificity; my goal was more to go as far as possible using common language and calculations, not relying on any specific packages.

这篇关于通过线性判别分析或逻辑回归评估/改进预测的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆