glmnet的标准化参数如何处理虚拟变量? [英] How does glmnet's standardize argument handle dummy variables?

查看:712
本文介绍了glmnet的标准化参数如何处理虚拟变量?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在我的数据集中,我有一些连续和虚拟变量。为了使用glmnet进行分析,我希望连续变量被标准化,而不是虚拟变量。



我目前通过首先定义一个只有值[0,1],然后在所有非虚拟列上使用 scale 命令。问题是,这不是很优雅。



但是,glmnet有一个内置的标准化参数。默认情况下,这将标准化虚拟机吗?如果是这样,是否有一种优雅的方式来告诉glmnet的标准化参数来跳过虚拟变量?

解决方案 glmnet 函数将矩阵作为其 X 参数的输入,而不是数据帧,因此它不如果参数是 data.frame ,则可以区分 factor 列。如果您查看R函数,glmnet将标准化参数内部编码为

  isd = as.integer(standardize)

将R布尔值转换为0或1整数进入任何内部FORTRAN函数(elnet,lognet等)



如果您进一步检查FORTRAN代码(固定宽度 - 老学校!),你会看到以下块:

 子例程standard1(no,ni,x,y,w, (no,ni),y(no),w(no),xm(ni),xs(ni),(x),xs,ym,ys,xv,jerr) ,xv(ni)989 
integer ju(ni)990
real,dimension(:),allocatable :: v
allocate(v(1:no),stat = jerr)993
if(jerr.ne.0)返回994
w = w / sum(w)994
v = sqrt(w)995
if(intr .ne。 0)goto 10651 995
ym = 0.0 995
y = v * y 996
ys = sqrt(dot_product(y,y)-dot_product(v,y)** 2)996
y = y / ys 997
10660 do 10661 j = 1,ni 997
if(ju(j).eq.0)goto 10661 997
xm(j)= 0.0 997
x(:,j)= v * x(:,j)998
xv(j)= dot_product(x(:,j),x(:,j))999
if isd .eq。0)goto 10681 999
xbq = dot_product(v,x(:,j))** 2 999
vc = xv(j)-xbq 1000
xs(j)= sqrt(vc)1000
x(:,j)= x(:,j)/ xs(j)1000
xv(j)= 1.0 + xbq / vc 1001
goto 10691 1002

看看标记为1000的行 - 这基本上是将标准化公式应用于 X 矩阵。



现在从统计学角度来说,通常不会将分类变量标准化,以保留估计回归的可解释性。然而,正如Tibshirani 这里所指出的,套索方法需要初始标准化对于归一化回归,对于分类回归算法,一个用虚拟变量对回归函数进行编码,然后对虚拟变量进行标准化 - 所以在这种情况下,连续变量和分类变量之间的任意缩放就是相等的惩罚处理。


In my dataset I have a number of continuous and dummy variables. For analysis with glmnet, I want the continuous variables to be standardized but not the dummy variables.

I currently do this manually by first defining a dummy vector of columns that have only values of [0,1] and then using the scale command on all the non-dummy columns. Problem is, this isn't very elegant.

But glmnet has a built in standardize argument. By default will this standardize the dummies too? If so, is there an elegant way to tell glmnet's standardize argument to skip dummies?

解决方案

In short, yes - this will standardize the dummy variables, but there's a reason for doing so. The glmnet function takes a matrix as an input for its X parameter, not a data frame, so it doesn't make the distinction for factor columns which you may have if the parameter was a data.frame. If you take a look at the R function, glmnet codes the standardize parameter internally as

    isd = as.integer(standardize)

Which converts the R boolean to a 0 or 1 integer to feed to any of the internal FORTRAN functions (elnet, lognet, et. al.)

If you go even further by examining the FORTRAN code (fixed width - old school!), you'll see the following block:

          subroutine standard1 (no,ni,x,y,w,isd,intr,ju,xm,xs,ym,ys,xv,jerr)    989
          real x(no,ni),y(no),w(no),xm(ni),xs(ni),xv(ni)                        989
          integer ju(ni)                                                        990
          real, dimension (:), allocatable :: v                                     
          allocate(v(1:no),stat=jerr)                                           993
          if(jerr.ne.0) return                                                  994
          w=w/sum(w)                                                            994
          v=sqrt(w)                                                             995
          if(intr .ne. 0)goto 10651                                             995
          ym=0.0                                                                995
          y=v*y                                                                 996
          ys=sqrt(dot_product(y,y)-dot_product(v,y)**2)                         996
          y=y/ys                                                                997
    10660 do 10661 j=1,ni                                                       997
          if(ju(j).eq.0)goto 10661                                              997
          xm(j)=0.0                                                             997
          x(:,j)=v*x(:,j)                                                       998
          xv(j)=dot_product(x(:,j),x(:,j))                                      999
          if(isd .eq. 0)goto 10681                                              999
          xbq=dot_product(v,x(:,j))**2                                          999
          vc=xv(j)-xbq                                                         1000
          xs(j)=sqrt(vc)                                                       1000
          x(:,j)=x(:,j)/xs(j)                                                  1000
          xv(j)=1.0+xbq/vc                                                     1001
          goto 10691                                                           1002

Take a look at the lines marked 1000 - this is basically applying the standardization formula to the X matrix.

Now statistically speaking, one does not generally standardize categorical variables to retain the interpretability of the estimated regressors. However, as pointed out by Tibshirani here, "The lasso method requires initial standardization of the regressors, so that the penalization scheme is fair to all regressors. For categorical regressors, one codes the regressor with dummy variables and then standardizes the dummy variables" - so while this causes arbitrary scaling between continuous and categorical variables, it's done for equal penalization treatment.

这篇关于glmnet的标准化参数如何处理虚拟变量?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆