从glmnet输出数据中提取数据 [英] Extract data from glmnet output data

查看:65
本文介绍了从glmnet输出数据中提取数据的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试使用glmnet软件包进行功能选择.我已经要运行glmnet.但是,我很难理解输出.我的目标是获取基因列表及其各自的系数,这样我就可以根据基因在分隔两组标签中的相关性来对基因列表进行排名.

I am trying to do feature selection using the glmnet package. I have been about to run the glmnet. However, I have a tough time understanding the output. My goal is to get the list of genes and their respective coefficients so I can rank the list of gene based on how relevant they are at separating my two group of labels.

x = manual_normalized_melt[,colnames(manual_normalized_melt) %in% 
sig_0_01_ROTS$Gene]
y = cellID_reference$conditions

glmnet_l0 <- glmnet(x = as.matrix(x), y = y, family = "binomial",alpha = 1)

关于我如何从这里出发的任何提示/说明?我知道我想要的数据在glmnet_l0内,但是我不确定如何提取它.

Any hints/instructions on how I go from here? I know that the data I want is within the glmnet_l0 but I am a bit unsure on how to extract it.

另外,有人知道在R中是否可以使用L0范数进行特征选择吗?

Additionally, anyone know if there is a way to use L0-norm for feature selection in R?

非常感谢您!

推荐答案

以下是 glmnet 中的一些方法:

首先获取一些数据,因为您没有发布任何数据(物种中具有两个级别的虹膜数据):

first some data because you did not post any (iris data with two levels in species):

data(iris)
x <- iris[,1:4]
y <- iris[,5]
y[y == "setosa"] <- "virginica"
y <- factor(y)

首先运行交叉验证模型以查看最佳的lambda:

First run a cross validation model to see what is the best lambda:

library(glmnet)
model_cv <- cv.glmnet(x = as.matrix(x),
                      y = y,
                      family = "binomial",
                      alpha = 1,
                      nfolds = 5,
                      intercept = FALSE)

在这里,我选择进行5倍交叉验证,以确定最佳的lambda.

Here I chose to have 5-fold cross validation to determine the best lambda.

也看不到最佳的λ系数:

Too see the coefficients at best lambda:

coef(model_cv, s = "lambda.min")
#output
#5 x 1 sparse Matrix of class "dgCMatrix"
                      1
(Intercept)   .        
Sepal.Length -0.7966676
Sepal.Width   1.9291364
Petal.Length -0.9502821
Petal.Width   2.7113327

在这里您看不到任何变量被删除(或者它们将包含.而不是系数).如果所有功能都在相同的尺度上(例如基因表达数据),则可以考虑将 standardize = FALSE 作为glmnet调用的参数,因为默认情况下将其设置为 TRUE .至少在建模表达式时我会这么做.

Here you can see no variables were dropped (or they would have . instead of a coefficient). If all the features are on the same scale (like gene expression data) you might consider adding standardize = FALSE as an argument to the glmnet call since it is by default set to TRUE. At least I would when modeling expression.

要查看最佳的lambda:

To see the best lambda:

model_cv$lambda[which.min(model_cv$cvm)]

现在您可以使用所有数据制作模型:

Now you can make a model with all the data:

glmnet_l0 <- glmnet(x = as.matrix(x),
                    y = y,
                    family = "binomial",
                    alpha = 1,
                    intercept = FALSE)

您可以在lambda比例上绘制它,并添加一条描绘最佳lambda的垂直线:

You can plot it on the lambda scale and add a vertical line depicting best lambda:

plot(glmnet_l0, xvar = "lambda")
abline(v = log(model_cv$lambda[which.min(model_cv$cvm)]))

在这里,人们可以看到,至多λ系数几乎都没有缩小.

Here one can see coefficients were hardly shrunk at all at best lambda.

使用更高维度的数据,您会看到许多系数迹线在最佳Lambda出现之前就趋向于0,并且还有许多.在coef矩阵中.

with higher dimensional data you will see many coefficient traces go towards 0 before best lambda kicks in and many . in the coef matrix.

使用 predict.glmnet 时,设置 s = model_cv $ lambda [which.min(model_cv $ cvm)] ,否则它将为所有测试过的lambda生成预测.

When using predict.glmnet set s = model_cv$lambda[which.min(model_cv$cvm)] or it will generate predictions for all tested lambda.

还要检查此帖子包含其他一些相关信息.

Also check this post it contains some other relevant information.

这篇关于从glmnet输出数据中提取数据的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆