使用 glmnet 预测数据集中的连续变量 [英] Using glmnet to predict a continuous variable in a dataset

查看:22
本文介绍了使用 glmnet 预测数据集中的连续变量的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有这个数据集.

最佳 lambda:

cv_fit$lambda[which.min(cv_fit$cvm)]

系数最多为 lambda

coef(cv_fit, s = cv_fit$lambda[which.min(cv_fit$cvm)])

相当于:

coef(cv_fit, s = "lambda.min")

运行 coef(cv_fit, s = "lambda.min") 后,结果表中带有 - 的所有特征都从模型中删除.这种情况对应于图中左侧垂直虚线所描绘的左侧 lambda.
我建议阅读链接的文档 - 如果您了解一些线性回归并且该包非常直观,那么弹性网络很容易掌握.我还建议阅读 ISLR,至少是 L1/L2 正则化的部分.以及这些视频:12, 3 45, 6,前三个是关于通过测试误差估计模型性能,后三个是关于手头的问题.这 one 是如何在 R 中实现这些模型.顺便说一下,这些家伙在视频发明了 LASSO 并制作了闪光.

还要检查提供公式的 glmnetUtils 库界面和其他不错的东西,如内置混合参数 (alpha) 选择.这是小插图.

I have this data set. wbh

I wanted to use the R package glmnet to determine which predictors would be useful in predicting fertility. However, I have been unable to do so, most likely due to not having a full understanding of the package. The fertility variable is SP.DYN.TFRT.IN. I want to see which predictors in the data set give the most predictive power for fertility. I wanted to use LASSO or ridge regression to shrink the number of coefficients, and I know this package can do that. I'm just having some trouble implementing it.

I know there are no code snippets which I apologize for but I am rather lost on how I would code this out.

Any advice is appreciated.

Thank you for reading

解决方案

Here is an example on how to run glmnet:

library(glmnet)
library(tidyverse)

df is the data set your provided.

select y variable:

y <- df$SP.DYN.TFRT.IN

select numerical variables:

df %>%
  select(-SP.DYN.TFRT.IN, -region, -country.code) %>%
  as.matrix() -> x

select factor variables and convert to dummy variables:

df %>%
  select(region, country.code) %>%
  model.matrix( ~ .-1, .) -> x_train

run model(s), several parameters here can be tweaked I suggest checking the documentation. Here I just run 5-fold cross validation to determine the best lambda

cv_fit <- cv.glmnet(x, y, nfolds = 5) #just with numeric variables

cv_fit_2 <- cv.glmnet(cbind(x ,x_train), y, nfolds = 5) #both factor and numeric variables

par(mfrow = c(2,1))
plot(cv_fit)
plot(cv_fit_2)

best lambda:

cv_fit$lambda[which.min(cv_fit$cvm)]

coefficients at best lambda

coef(cv_fit, s = cv_fit$lambda[which.min(cv_fit$cvm)])

equivalent to:

coef(cv_fit, s = "lambda.min")

after running coef(cv_fit, s = "lambda.min") all features with - in the resulting table are dropped from the model. This situation corresponds to the left lambda depicted with the left vertical dashed line on the plots.
I suggest reading the linked documentation - elastic nets are quite easy to grasp if you know a bit of linear regression and the package is quite intuitive. I also suggest reading ISLR, at least the part with L1 / L2 regularization. and these videos: 1, 2, 3 4, 5, 6, first three are about estimating model performance via test error and the last three are about the question at hand. This one is how to implement these models in R. By the way these guys on the videos invented LASSO and made glment.

Also check the glmnetUtils library which provides a formula interface and other nice things like in built mixing parameter (alpha) selection. Here is the vignette.

这篇关于使用 glmnet 预测数据集中的连续变量的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆