使用glmnet预测数据集中的连续变量 [英] Using glmnet to predict a continuous variable in a dataset

查看:272
本文介绍了使用glmnet预测数据集中的连续变量的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有这个数据集. wbh

我想使用R包glmnet来确定哪些预测因子将对预测生育能力有用.但是,我之所以无法这样做,很可能是由于对该程序包没有足够的了解.生育力变量是SP.DYN.TFRT.IN.我想看看数据集中哪些预测因素对生育率具有最强的预测能力.我想使用LASSO或岭回归来减少系数的数量,而且我知道此程序包可以做到这一点.我在执行它时遇到了一些麻烦.

I wanted to use the R package glmnet to determine which predictors would be useful in predicting fertility. However, I have been unable to do so, most likely due to not having a full understanding of the package. The fertility variable is SP.DYN.TFRT.IN. I want to see which predictors in the data set give the most predictive power for fertility. I wanted to use LASSO or ridge regression to shrink the number of coefficients, and I know this package can do that. I'm just having some trouble implementing it.

我知道我没有道歉的代码片段,但是我对如何将其编码感到迷茫.

I know there are no code snippets which I apologize for but I am rather lost on how I would code this out.

任何建议都值得赞赏.

感谢您阅读

推荐答案

以下是如何运行glmnet的示例:

Here is an example on how to run glmnet:

library(glmnet)
library(tidyverse)

df是您提供的数据集.

df is the data set your provided.

选择y变量:

y <- df$SP.DYN.TFRT.IN

选择数字变量:

df %>%
  select(-SP.DYN.TFRT.IN, -region, -country.code) %>%
  as.matrix() -> x

选择因子变量并转换为虚拟变量:

select factor variables and convert to dummy variables:

df %>%
  select(region, country.code) %>%
  model.matrix( ~ .-1, .) -> x_train

运行模型,这里的几个参数可以调整,我建议检查文档.在这里,我只运行5倍交叉验证以确定最佳的lambda

run model(s), several parameters here can be tweaked I suggest checking the documentation. Here I just run 5-fold cross validation to determine the best lambda

cv_fit <- cv.glmnet(x, y, nfolds = 5) #just with numeric variables

cv_fit_2 <- cv.glmnet(cbind(x ,x_train), y, nfolds = 5) #both factor and numeric variables

par(mfrow = c(2,1))
plot(cv_fit)
plot(cv_fit_2)

最佳lambda:

cv_fit$lambda[which.min(cv_fit$cvm)]

最佳λ系数

coef(cv_fit, s = cv_fit$lambda[which.min(cv_fit$cvm)])

等同于:

coef(cv_fit, s = "lambda.min")

运行coef(cv_fit, s = "lambda.min")后,

将从结果模型中删除结果表中所有带有-的特征.这种情况对应于在图上用左垂直虚线描绘的左lambda.
我建议阅读链接的文档-如果您了解一些线性回归并且包装非常直观,则弹力网非常容易掌握.我还建议阅读 ISLR ,至少是具有L1/L2正则化的部分.以及以下视频: 1 3 4 6 ,前三个与通过测试错误估计模型性能有关,后三个与当前问题有关. one 是如何在R中实现这些模型.视频发明了LASSO并引人注目.

after running coef(cv_fit, s = "lambda.min") all features with - in the resulting table are dropped from the model. This situation corresponds to the left lambda depicted with the left vertical dashed line on the plots.
I suggest reading the linked documentation - elastic nets are quite easy to grasp if you know a bit of linear regression and the package is quite intuitive. I also suggest reading ISLR, at least the part with L1 / L2 regularization. and these videos: 1, 2, 3 4, 5, 6, first three are about estimating model performance via test error and the last three are about the question at hand. This one is how to implement these models in R. By the way these guys on the videos invented LASSO and made glment.

还要检查 glmnetUtils 库,该库提供了公式界面和其他不错的功能,例如内置的混合参数(alpha)选择.这是 vignette .

Also check the glmnetUtils library which provides a formula interface and other nice things like in built mixing parameter (alpha) selection. Here is the vignette.

这篇关于使用glmnet预测数据集中的连续变量的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆