R中为GLM物种分布模型创建响应曲线的最佳方法? [英] best way to create response curves for a GLM species distribution model in R?
问题描述
我正在运行二项式GLM来预测物种发生的可能性,我正在其中一个数据集上进行训练,并在另一个数据集上对该模型进行测试:
I'm running a binomial GLM to predict the probability of a species occurrence, where I am training on one dataset and testing the model on another dataset:
TrainingData<-read.csv("TrainingData.csv")[,-1]
TrainingData[,1]<-as.factor(TrainingData[,1])
TrainingData[,4]<-as.factor(TrainingData[,4])
TestData<-read.csv("TestData.csv")[,-1]
TestData[,1]<-as.factor(TestData[,1])
TestData[,4]<-as.factor(TestData[,4])
mod<-glm(presence~var1+var2+var3, family=binomial, data=TrainingData)
probs=predict(mod, TestData, type="response")
创建响应曲线以绘制存在概率与每个预测变量之间的关系的最佳方法(或函数)是什么?
What is the best way (or function) to create response curves to plot the relationship between the probability of presence and each predictor variable?
谢谢!
推荐答案
边际概率可以从类型为"terms"的predict.glm计算得出,因为每个术语都是在将其余变量设置为平均值的情况下计算出来的.
这会通过plogis(条件+截距)转换回概率等级.
The marginal probabilities can be calculated from predict.glm with type = "terms",
since each of the terms are calculated with the remaining variables set at their mean values.
This is converted back to a probabilty scale with plogis(term + intercept).
第二,因为您的数据集包含连续值和因子的组合对于您的预测变量,每种类型分别绘制并合并与grid.arrange.
Second, because your data set contains and combination of continuous values and factors for your predictor variables, separate plots were made for each type and combined with grid.arrange.
尽管这直接基于您提供的glm模型回答了您的问题,我仍然建议您检查两个预测变量的空间自相关和响应变量,因为这可能会对您的最终模型产生影响.
Although this answers your question directly based on the glm model you presented, I would still recommend examining the spatial autocorrelation of both your predictor and response variables, as this could have a likely impact on your final model.
library(reshape2)
library(dplyr)
library(tidyr)
library(ggplot2)
library(gridExtra)
TrainingData <- read.csv("~/Downloads/TrainingData.csv", header = TRUE)
TrainingData[['presence']] <- as.factor(TrainingData[['presence']])
TrainingData[['var3']] <- as.factor(TrainingData[['var3']])
TrainingData[['X']] <- NULL # Not used in the model
TestData <- read.csv("~/Downloads/TestData.csv", header = TRUE)
TestData[['presence']] <- as.factor(TestData[['presence']])
TestData[['var3']] <- as.factor(TestData[['var3']])
TestData[['X']] <- NULL
在场/不在场模型
mod <- glm(presence ~ var1 + var2 + var3, family = binomial, data = TrainingData)
获取每个中心变量的预测概率(即,将其余变量设置为其平均值).
Get predicted probabilities for each of the centered variables (i.e remaining variables set to their mean).
mod_terms <- predict(mod, newdata = TestData, type = "terms")
mod_prob <- data.frame(idx = 1:nrow(TestData), plogis(mod_terms +
attr(mod_terms, "constant")))
mod_probg <- mod_prob %>% gather(variable, probability, -idx)
将测试数据融合为长格式
Melt the Test data into long format
TestData['idx'] <- 1:nrow(TestData) # Add index to data
TestData[['X']] <- NULL # Drop the X variable since it was not used in the model
data_long <- melt(TestData, id = c("presence","idx"))
data_long[['value']] <- as.numeric(data_df[['value']])
将Testdata与预测合并,并分离包含连续变量(var1和var2)和因子变量(var3)的数据.
Merge Testdata with predictions and separate the data containing continuous (var1 and var2) and factors (var3).
# Merge Testdata with predictions
data_df <- merge(data_long, mod_probg, by = c("idx", "variable"))
data_df <- data_df %>% arrange(variable, value)
data_continuous <- data_df %>% filter(., variable != "var3") %>%
transform(value = as.numeric(value)) %>% arrange(variable, value)
data_factor <- data_df %>% filter(., variable == "var3") %>%
transform(value = as.factor(value))%>%
arrange(idx)
ggplot输出
g_continuous <- ggplot(data_continuous, aes(x = value, y = probability)) + geom_point()+
facet_wrap(~variable, scales = "free_x")
g_factor <- ggplot(data = data_factor, aes(x = value, y = probability)) + geom_boxplot() +
facet_wrap(~variable)
grid.arrange(g_continuous, g_factor, nrow = 1)
这篇关于R中为GLM物种分布模型创建响应曲线的最佳方法?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!