使用线性判别分析或逻辑回归评估/改进预测 [英] Assessing/Improving prediction with linear discriminant analysis or logistic regression

查看:28
本文介绍了使用线性判别分析或逻辑回归评估/改进预测的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我最近需要在一些数据集上组合两个或多个变量来评估它们的组合是否可以增强预测性,因此我在 R 中做了一些逻辑回归.现在,在统计问答中,有人建议我可以使用线性判别分析.

I recently needed to combine two or more variables on some data set to evaluate if their combination could enhance predictivity, thus I made some logistic regression in R. Now, on the statistic Q&A, someone suggested that I may use the linear discriminant analysis.

由于我在 MATLAB 中没有任何 fitcdiscr.m,我宁愿在 R 中使用 lda,但我不能使用拟合结果来预测 AUC 或我可以使用的任何东西.确实,我看到 R 中 lda 的拟合输出向量是某种具有多个类的向量,我想我应该使用 fit$posterior 来预测针对控件的案例,但我不能从中取出这些数据

Since I don't have any fitcdiscr.m in MATLAB, I'd rather go with lda in R but I cannot use the fit results to predict AUC or whatever I could use. Indeed, I see that fit output vector of lda in R is some sort of vector with multiple classes and I guess I should use fit$posterior to predict Cases against Controls, but I cannot take those data out of it.

有关更多信息,我将结果设为 fit$posterior:

For further information, I get this results as fit$posterior:

$posterior
            0          1
1   0.7707927 0.22920726
2   0.7085165 0.29148352
3   0.6990989 0.30090106
4   0.5902161 0.40978387
5   0.8667109 0.13328912
6   0.6924406 0.30755939
7   0.7471086 0.25289141
8   0.7519326 0.24806736

依此类推,直到最后一个观察值 242.例如,每次我尝试通过 fit$posterior[,1] 取第 1 列时,我得到:

And so on up to the last observation which is 242. Every time I try to take, for example, column 1 by fit$posterior[,1], I get:

        1         2         3         4         5         6         7         8 
0.7707927 0.7085165 0.6990989 0.5902161 0.8667109 0.6924406 0.7471086 0.7519326 
        9        10        11        12        13        14        15        16 
0.7519326 0.6902850 0.7519326 0.8080445 0.8075360 0.8484318 0.4860899 0.8694121

我不知道代码的哪一部分可能有用,因为我做了非常基本的计算:

I don't know which part of the code could be useful, since I made very basic computation:

library(gdata)
data=read.xls("ECGvarious.xls", perl="C:/Strawberry/perl/bin/perl.exe");
i=6;
p=19;
temp=data[,i];
temp1=data[, p];
library(MASS)
fit <- lda(Case ~ temp + temp , data=data, na.action="na.omit", CV=TRUE)

我无法链接数据,无论如何,ECGvarious 只是 N 个观察值 x P 个变量,N= N1+ N2,其中 N1 是对照数,N2 是病例数,病例被定义为发生病理的受试者跟进后.对于 Controls 和 Cases,最后一列数据分别为 0 或 1.

I can't link the data, anyway ECGvarious is simply an N observation x P variables, being N= N1+ N2 with N1 the number of Controls and N2 the number of Cases, and the Cases are defined as subjects who developed pathology after a follow up. The very last column of data is just 0 or 1 for Controls and Cases, respectively.

当我执行逻辑回归时,我做了:

When I performed the logistic regression, I did:

mod1<-glm(Case ~ temp + temp1, data=data,     family="binomial"); 
auctemp=auc(Case~predict(mod1), data=data);

推荐答案

这是我关于逻辑回归和预测的输入(我不太了解线性判别,但知道它与逻辑回归密切相关,我更了解).我不确定我是否遵循了您的所有推理,也不确定这是否会是一个令人满意的答案,但希望它不会受到伤害.这是我对一些流行病学课程的回顾.我希望它不会太正式,并且至少部分地解决了您的一些问题.如果没有,并且如果其他用户认为这更适合交叉验证,我不会生气.:)

Here's my input concerning logistic regression and prediction (I don't know much about linear discrimination but understand it's closely related to logistic regression, which I know much better). I'm not sure I'm following all of your reasoning, nor if this will be a satisfactory answer, but hopefully it won't hurt. This has been a review of some epidemiology classes for me. I hope it's not too formal and addresses at least in part some of your questions. If not, and if other users think this would better belong on Cross Validated, I won't take offense. :)

我们将首先生成 200 个观察值,对于 Case=1 的概率水平越来越高.第一个预测变量 (pred1) 将遵循非线性分布,接近进行逻辑回归时建模的分布.它将与案例的比例密切相关.第二个预测变量只是随机的、均匀分布的噪声.

We'll first generate 200 observations, having increasing levels of probability for Case=1. The first predictor (pred1) will follow a distribution that is nonlinear, close to the one being modeled when doing logistic regression. It will be rather closely related to the proportion of Cases. The second predictor will just be random, uniformly distributed noise.

set.seed(2351)
df <- data.frame(Case = c(sample(c(0,1), size = 67, prob = c(0.8, 0.2), replace = TRUE), 
                          sample(c(0,1), size = 66, prob = c(0.5, 0.5), replace = TRUE), 
                          sample(c(0,1), size = 67, prob = c(0.2, 0.8), replace = TRUE)),
                 pred1 = 6/(1+4*exp(-seq(from = -3, to = 5, length.out = 200))) + rnorm(n = 200, mean = 2, sd=.5),
                 pred2 = runif(n = 200, min = 0, max = 100))

我们在下面的箱线图中看到,case==1 通常具有更高的 pred1,这是我们预期的(根据我们生成数据的方式).同时,有重叠,否则太容易决定一个截止点/阈值.

We see in the boxplot below that the observations where case==1 generally have higher pred1, which is intended (from the way we generated the data). At the same time, there is an overlap, otherwise it would make it too easy to decide on a cutoff point/threshold.

boxplot(pred1 ~ Case, data=df, xlab="Case", ylab="pred1")

首先使用两个预测变量:

First using both predictors:

model.1 <- glm(Case ~ pred1 + pred2, data=df, family=binomial(logit))
summary(model.1)

# Coefficients:
#              Estimate Std. Error z value Pr(>|z|)    
# (Intercept) -2.058258   0.479094  -4.296 1.74e-05 ***
# pred1        0.428491   0.075373   5.685 1.31e-08 ***
# pred2        0.003399   0.005500   0.618    0.537    
# ---
# Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
# 
# (Dispersion parameter for binomial family taken to be 1)
# 
#     Null deviance: 276.76  on 199  degrees of freedom
# Residual deviance: 238.51  on 197  degrees of freedom
# AIC: 244.51

正如我们所料,第一个预测因子与结果的相关性相当强,而第二个预测因子与结果的相关性较差.

As we'd expect, the first predictor is rather strongly related, and the second, poorly related to the outcome.

请注意,要从这些系数中获得优势比,我们需要对它们取幂:

Note that to get Odds Ratios from those coefficients, we need to exponentiate them:

exp(model.1$coefficients[2:3])

#    pred1    pred2 
# 1.534939 1.003405   # Odds Ratios (making the relationships appear more clearly). 
                      # Use `exp(confint(model.1))` to get confidence intervals.

我们将把这个模型与一个更简单的模型进行比较,去掉第二个预测变量:

We'll compare this model to a simpler model, removing the second predictor:

model.2 <- glm(Case ~ pred1, data=df, family=binomial(logit))
summary(model.2)

# Coefficients:
#             Estimate Std. Error z value Pr(>|z|)    
# (Intercept) -1.87794    0.37452  -5.014 5.32e-07 ***
# pred1        0.42651    0.07514   5.676 1.38e-08 ***
# ---
# Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#
# (Dispersion parameter for binomial family taken to be 1)
#
#     Null deviance: 276.76  on 199  degrees of freedom
# Residual deviance: 238.89  on 198  degrees of freedom
# AIC: 242.89

exp(model.2$coefficients)[2]

#    pred1 
# 1.531907  # Odds Ratio

我们也可以运行 anova(model.1, model.2),但让我们跳过这一部分并继续进行预测,保留这个更简单的模型,因为第二个变量不会增加太多预测值,如果有的话.实际上,除非它是真正的随机噪声,否则拥有更多的预测器很少会成为问题,但在这里我更多地关注预测和选择合适阈值的操作.

We could also run an anova(model.1, model.2), but let's skip this part and move on to prediction, keeping this simpler model as the second variable doesn't add much predictive value, if any. In practive, having more predictors is rarely a problem unless it's truly random noise, but here I focus more on the operation of predicting and choosing a proper threshold.

model.2 对象(一个列表)中,有一个名为fitted.values 的项目.这些值与我们从 predict(model.2, type="response") 得到的完全相同,可以解释为概率;根据预测变量及其系数,每行一个.

In the model.2 object (a list), there is an item named fitted.values. Those values are the exact same that we'd get from predict(model.2, type="response") and can be interpreted as probabilities; one for each row, based on the predictor(s) and their coefficient(s).

还可以预测不在我们初始数据框中的假设行的结果.

It is also possible to predict the outcome for hypothetical rows not in our initial dataframe.

使用 model.1(2 个预测变量):

With model.1 (2 predictors):

predict(model.1, newdata = list(pred1=1, pred2=42), type="response")

#         1 
# 0.1843701 

使用 model.2(1 个预测器):

With model.2 (1 predictor):

predict(model.2, newdata = list(pred1=12), type="response")

#       1 
# 0.96232 

<小时>

从概率到二元响应

回顾我们的预测器 pred1 和计算出的 Case=1 概率之间的联系:


Going from probability to binary response

Looking back at the link between our predictor pred1 and the calculated probability of having Case=1:

plot(df$pred1, model.2$fitted.values, 
     xlab="pred1", ylab="probability that Case=1")

我们注意到,由于我们只有一个预测变量,概率是它的直接函数.如果我们在等式中保留另一个预测变量,我们会看到点围绕同一条线分组,但位于点云中.

但这并没有改变这样一个事实,即如果我们要评估我们的模型预测二元结果的能力,我们需要确定一个阈值,高于该阈值我们将认为观察是一个案例.有几个软件包具有帮助选择该阈值的工具.但即使没有任何额外的包,我们也可以使用如下函数计算一系列阈值范围内的各种属性,该函数将计算敏感性(检测真实案例的能力)、特异性(识别真正的非案例的能力)和其他很好描述的属性这里.

But this doesn't change the fact that if we are to evaluate how well our model can predict binary outcomes, we need to settle on a threshold above which we'll consider that the observation is a Case. Several packages have tools to help picking that threshold. But even without any additional package, we can calculate various properties over a range of thresholds using a function such as the following, which will calculate the sensitivity (ability to detect True Cases), specificity (ability to identify True Non Cases), and other properties well described here.

df.ana <- data.frame(thresh=seq(from = 0, to = 100, by = 0.5) / 100)
for(i in seq_along(df.ana$thresh)) {
    df.ana$sensitivity[i] <- sum(df$Case==1 & (predict(model.2, type="resp") >= df.ana$thresh[i])) / sum(df$Case==1)
    df.ana$specificity[i] <- sum(df$Case==0 & (predict(model.2, type="resp") < df.ana$thresh[i])) / sum(df$Case==0)
    df.ana$pos.pred.value[i] <- sum(df$Case == 1 & (predict(model.2, type="resp") >= df.ana$thresh[i])) / sum(predict(model.2, type="resp") >= df.ana$thresh[i])
    df.ana$neg.pred.value[i] <- sum(df$Case == 0 & (predict(model.2, type="resp") < df.ana$thresh[i])) / sum(predict(model.2, type="resp") < df.ana$thresh[i])
    df.ana$accuracy[i] <- sum((predict(model.2, type="resp") >= df.ana$thresh[i]) == df$Case) / nrow(df)
}

which.max(df.ana$accuracy)

# [1] 46 

optimal.thresh <- df.ana$thresh[which.max(df.ana$accuracy)] # 0.46

准确率是正确预测占所有预测的比例.第 46 个阈值 (0.46) 是这方面的最佳".让我们检查生成的数据帧中的其他几个相邻行;它告诉我们 0.47 在所有方面都适用.微调将涉及向我们的初始数据帧添加一些新数据.

The accuracy is the proportion of correct predictions over all predictions. The 46th threshold (0.46) is the "best" for that matter. Let's check a few other neighboring rows in the generated dataframe; it tells us that 0.47 would work as well on all fronts. Fine-tuning would involve adding some new data to our initial dataframe.

df.ana[45:48,]

#    thresh sensitivity specificity pos.pred.value neg.pred.value accuracy
# 45   0.45   0.7142857   0.6947368      0.7211538      0.6875000    0.705
# 46   0.46   0.7142857   0.7157895      0.7352941      0.6938776    0.715
# 47   0.47   0.7142857   0.7157895      0.7352941      0.6938776    0.715
# 48   0.48   0.7047619   0.7157895      0.7326733      0.6868687    0.710

请注意,auc 函数(曲线下面积)将给出与该阈值的准确度相同的数字:

Note that the auc function (area under the curve) will give the same number as the accuracy for that threshold:

library(pROC)
auc(Case ~ as.numeric(predict(model.2, type="response") >= optimal.thresh), data=df)

# Area under the curve: 0.715

<小时>

一些情节

# thresholds against accuracy
plot(x=df.ana$thresh, y=df.ana$accuracy, type="l",
         xlab="Threshold", ylab="", xlim=c(0,1), ylim=c(0,1))
text(x = 0.1, y = 0.5, labels = "Accuracy", col="black")

# thresholds against Sensitivity 
lines(x=df.ana$thresh, y=df.ana$sensitivity, type="l",col="blue") # Sensitivity We want to maximize this, but not too much
text(x = 0.1, y = 0.95, labels = "Sensitivity", col="blue")

# thresholds against specificity 
lines(x=df.ana$thresh, y=df.ana$specificity, type="l", col="red") # Specificity we want to maximize also, but not too much
text(x = 0.1, y = 0.05, labels = "Specificity", col="red")

# optimal threshold vertical line
abline(v=optimal.thresh)
text(x=optimal.thresh + .01, y=0.05, labels= optimal.thresh)

顺便说一句,所有线或多或少都收敛到同一点,这表明这是我们在预测工具中寻找的所有品质之间的良好折衷.但根据您的目标,选择较低或较高的阈值可能会更好.统计工具很有用,但最终,在做出最终决定时,其他一些考虑因素往往更为重要.

Incidentally, all lines converge more or less to the same point, which suggests this is a good compromise between all the qualities we look for in a predictive tool. But depending on your objectives, it might be better picking a lower or a higher threshold. Statistical tools are useful, but in the end, some other considerations are often more important in making a final decision.

下图与使用 pROC 的 roc 生成的图相同:

The following graph is the same as the one which would be produced with pROC's roc:

plot(x=df.ana$specificity, y = df.ana$sensitivity, type="l", col="blue",
         xlim = c(1,0), xlab = "Specificity", ylab = "Sensitivity") 

# Equivalent to
# plot(roc(predictor=model.2$fitted.values, response = model.2$y))

下面的函数允许人们为逻辑模型拟合计算与上面看到的相同的统计数据,并为任何选定的阈值提供一个 2x2 的表格.

The following function allows one to calculate, for a logistic model fit, the same stats seen above, and gives a 2x2 table for any chosen threshold.

diagnos.test <- function(model, threshold) {
    output <- list()
    output$stats <- c(
      sensitivity = sum(model.1$y==1 & (predict(model, type="resp") >= threshold)) / sum(model.1$y==1),
      specificity = sum(model.1$y==0 & (predict(model, type="resp") < threshold)) / sum(model.1$y==0),
      pos.pr.value = sum(model.1$y==1 & (predict(model.2, type="resp") >= threshold)) / sum(predict(model.2, type="resp") >= threshold),
      neg.pr.value = sum(df$Case == 0 & (predict(model.2, type="resp") < threshold)) / sum(predict(model.2, type="resp") < threshold),
      accuracy = sum((predict(model.2, type="resp") >= threshold) == df$Case) / nrow(df))
    output$tab <- addmargins(t(table(model$y, as.numeric(predict(model, type="response") > threshold),dnn = list("Cases", "Predictions")))[2:1,2:1])
    return(output)
}

diagnos.test(model.2, 0.47)

# $stats
#  sensitivity  specificity pos.pr.value neg.pr.value     accuracy 
#    0.7142857    0.7157895    0.7352941    0.6938776    0.7150000 
# 
# $tab
#            Cases
# Predictions   1  0 Sum
#         1    75 27 102
#         0    30 68  98
#         Sum 105 95 200

<小时>

最后说明

我不会假装我已经涵盖了关于预测、敏感性和特异性的所有内容;我的目标是尽可能使用通用语言和计算,而不是依赖任何特定的包.

I don't pretend I have covered everything on prediction, sensitivity and specificity; my goal was more to go as far as possible using common language and calculations, not relying on any specific packages.

这篇关于使用线性判别分析或逻辑回归评估/改进预测的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆