glmnet 的标准化参数如何处理虚拟变量? [英] How does glmnet's standardize argument handle dummy variables?

查看:41
本文介绍了glmnet 的标准化参数如何处理虚拟变量?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在我的数据集中,我有许多连续变量和虚拟变量.对于 glmnet 的分析,我希望对连续变量进行标准化,而不是对虚拟变量进行标准化.

In my dataset I have a number of continuous and dummy variables. For analysis with glmnet, I want the continuous variables to be standardized but not the dummy variables.

我目前手动执行此操作,首先定义一个只有 [0,1] 值的列的虚拟向量,然后在所有非虚拟列上使用 scale 命令.问题是,这不是很优雅.

I currently do this manually by first defining a dummy vector of columns that have only values of [0,1] and then using the scale command on all the non-dummy columns. Problem is, this isn't very elegant.

但是 glmnet 有一个内置的 standardize 参数.默认情况下,这也会标准化假人吗?如果是这样,有没有一种优雅的方法来告诉 glmnet 的 standardize 参数跳过假人?

But glmnet has a built in standardize argument. By default will this standardize the dummies too? If so, is there an elegant way to tell glmnet's standardize argument to skip dummies?

推荐答案

简而言之,是的 - 这将标准化虚拟变量,但这样做是有原因的.glmnet 函数将矩阵作为其 X 参数的输入,而不是数据框,因此它不区分 factor如果参数是 data.frame,您可能拥有的列.如果你看一下 R 函数,glmnet 将 standardize 参数内部编码为

In short, yes - this will standardize the dummy variables, but there's a reason for doing so. The glmnet function takes a matrix as an input for its X parameter, not a data frame, so it doesn't make the distinction for factor columns which you may have if the parameter was a data.frame. If you take a look at the R function, glmnet codes the standardize parameter internally as

    isd = as.integer(standardize)

将 R 布尔值转换为 0 或 1 整数,以提供给任何内部 FORTRAN 函数(elnet、lognet 等)

Which converts the R boolean to a 0 or 1 integer to feed to any of the internal FORTRAN functions (elnet, lognet, et. al.)

如果您进一步检查 FORTRAN 代码(固定宽度 - 老派!),您将看到以下代码块:

If you go even further by examining the FORTRAN code (fixed width - old school!), you'll see the following block:

          subroutine standard1 (no,ni,x,y,w,isd,intr,ju,xm,xs,ym,ys,xv,jerr)    989
          real x(no,ni),y(no),w(no),xm(ni),xs(ni),xv(ni)                        989
          integer ju(ni)                                                        990
          real, dimension (:), allocatable :: v                                     
          allocate(v(1:no),stat=jerr)                                           993
          if(jerr.ne.0) return                                                  994
          w=w/sum(w)                                                            994
          v=sqrt(w)                                                             995
          if(intr .ne. 0)goto 10651                                             995
          ym=0.0                                                                995
          y=v*y                                                                 996
          ys=sqrt(dot_product(y,y)-dot_product(v,y)**2)                         996
          y=y/ys                                                                997
    10660 do 10661 j=1,ni                                                       997
          if(ju(j).eq.0)goto 10661                                              997
          xm(j)=0.0                                                             997
          x(:,j)=v*x(:,j)                                                       998
          xv(j)=dot_product(x(:,j),x(:,j))                                      999
          if(isd .eq. 0)goto 10681                                              999
          xbq=dot_product(v,x(:,j))**2                                          999
          vc=xv(j)-xbq                                                         1000
          xs(j)=sqrt(vc)                                                       1000
          x(:,j)=x(:,j)/xs(j)                                                  1000
          xv(j)=1.0+xbq/vc                                                     1001
          goto 10691                                                           1002

看看标记为 1000 的行 - 这基本上是将标准化公式应用于 X 矩阵.

Take a look at the lines marked 1000 - this is basically applying the standardization formula to the X matrix.

现在从统计学上讲,通常不会对分类变量进行标准化以保留估计回归量的可解释性.然而,正如 Tibshirani here 所指出的,套索方法需要初始标准化回归变量,因此惩罚方案对所有回归变量都是公平的.对于分类回归变量,使用虚拟变量对回归变量进行编码,然后对虚拟变量进行标准化"——因此,虽然这会导致连续变量和分类变量之间的任意缩放,但它是在相等的情况下完成的处罚处理.

Now statistically speaking, one does not generally standardize categorical variables to retain the interpretability of the estimated regressors. However, as pointed out by Tibshirani here, "The lasso method requires initial standardization of the regressors, so that the penalization scheme is fair to all regressors. For categorical regressors, one codes the regressor with dummy variables and then standardizes the dummy variables" - so while this causes arbitrary scaling between continuous and categorical variables, it's done for equal penalization treatment.

这篇关于glmnet 的标准化参数如何处理虚拟变量?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆