glmnet 的标准化参数如何处理虚拟变量? [英] How does glmnet's standardize argument handle dummy variables?
问题描述
在我的数据集中,我有许多连续变量和虚拟变量.对于 glmnet 的分析,我希望对连续变量进行标准化,而不是对虚拟变量进行标准化.
In my dataset I have a number of continuous and dummy variables. For analysis with glmnet, I want the continuous variables to be standardized but not the dummy variables.
我目前手动执行此操作,首先定义一个只有 [0,1] 值的列的虚拟向量,然后在所有非虚拟列上使用 scale
命令.问题是,这不是很优雅.
I currently do this manually by first defining a dummy vector of columns that have only values of [0,1] and then using the scale
command on all the non-dummy columns. Problem is, this isn't very elegant.
但是 glmnet 有一个内置的 standardize
参数.默认情况下,这也会标准化假人吗?如果是这样,有没有一种优雅的方法来告诉 glmnet 的 standardize
参数跳过假人?
But glmnet has a built in standardize
argument. By default will this standardize the dummies too? If so, is there an elegant way to tell glmnet's standardize
argument to skip dummies?
推荐答案
简而言之,是的 - 这将标准化虚拟变量,但这样做是有原因的.glmnet
函数将矩阵作为其 X
参数的输入,而不是数据框,因此它不区分 factor
如果参数是 data.frame
,您可能拥有的列.如果你看一下 R 函数,glmnet 将 standardize
参数内部编码为
In short, yes - this will standardize the dummy variables, but there's a reason for doing so. The glmnet
function takes a matrix as an input for its X
parameter, not a data frame, so it doesn't make the distinction for factor
columns which you may have if the parameter was a data.frame
. If you take a look at the R function, glmnet codes the standardize
parameter internally as
isd = as.integer(standardize)
将 R 布尔值转换为 0 或 1 整数,以提供给任何内部 FORTRAN 函数(elnet、lognet 等)
Which converts the R boolean to a 0 or 1 integer to feed to any of the internal FORTRAN functions (elnet, lognet, et. al.)
如果您进一步检查 FORTRAN 代码(固定宽度 - 老派!),您将看到以下代码块:
If you go even further by examining the FORTRAN code (fixed width - old school!), you'll see the following block:
subroutine standard1 (no,ni,x,y,w,isd,intr,ju,xm,xs,ym,ys,xv,jerr) 989
real x(no,ni),y(no),w(no),xm(ni),xs(ni),xv(ni) 989
integer ju(ni) 990
real, dimension (:), allocatable :: v
allocate(v(1:no),stat=jerr) 993
if(jerr.ne.0) return 994
w=w/sum(w) 994
v=sqrt(w) 995
if(intr .ne. 0)goto 10651 995
ym=0.0 995
y=v*y 996
ys=sqrt(dot_product(y,y)-dot_product(v,y)**2) 996
y=y/ys 997
10660 do 10661 j=1,ni 997
if(ju(j).eq.0)goto 10661 997
xm(j)=0.0 997
x(:,j)=v*x(:,j) 998
xv(j)=dot_product(x(:,j),x(:,j)) 999
if(isd .eq. 0)goto 10681 999
xbq=dot_product(v,x(:,j))**2 999
vc=xv(j)-xbq 1000
xs(j)=sqrt(vc) 1000
x(:,j)=x(:,j)/xs(j) 1000
xv(j)=1.0+xbq/vc 1001
goto 10691 1002
看看标记为 1000 的行 - 这基本上是将标准化公式应用于 X
矩阵.
Take a look at the lines marked 1000 - this is basically applying the standardization formula to the X
matrix.
现在从统计学上讲,通常不会对分类变量进行标准化以保留估计回归量的可解释性.然而,正如 Tibshirani here 所指出的,套索方法需要初始标准化回归变量,因此惩罚方案对所有回归变量都是公平的.对于分类回归变量,使用虚拟变量对回归变量进行编码,然后对虚拟变量进行标准化"——因此,虽然这会导致连续变量和分类变量之间的任意缩放,但它是在相等的情况下完成的处罚处理.
Now statistically speaking, one does not generally standardize categorical variables to retain the interpretability of the estimated regressors. However, as pointed out by Tibshirani here, "The lasso method requires initial standardization of the regressors, so that the penalization scheme is fair to all regressors. For categorical regressors, one codes the regressor with dummy variables and then standardizes the dummy variables" - so while this causes arbitrary scaling between continuous and categorical variables, it's done for equal penalization treatment.
这篇关于glmnet 的标准化参数如何处理虚拟变量?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!