在套索sklearn中,选项normalize = True是什么? [英] what does the option normalize = True in Lasso sklearn do?

查看:1211
本文介绍了在套索sklearn中,选项normalize = True是什么?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个矩阵,其中每列的均值分别为0和std 1

In [67]: x_val.std(axis=0).min()
Out[70]: 0.99999999999999922

In [71]: x_val.std(axis=0).max()
Out[71]: 1.0000000000000007

In [72]: x_val.mean(axis=0).max()
Out[72]: 1.1990408665951691e-16

In [73]: x_val.mean(axis=0).min()
Out[73]: -9.7144514654701197e-17

如果使用normalize选项,则非0系数的数量会发生变化

In [74]: l = Lasso(alpha=alpha_perc70).fit(x_val, y_val)

In [81]: sum(l.coef_!=0)
Out[83]: 47

In [84]: l2 = Lasso(alpha=alpha_perc70, normalize=True).fit(x_val, y_val)

In [93]: sum(l2.coef_!=0)
Out[95]: 3

在我看来,归一化只是将每列的方差设置为1.这很奇怪,结果变化很大.我的数据已经有方差= 1.

那么normalize = T实际做什么?

解决方案

这是由于sklearn.linear_model.base.center_data中的缩放比例概念存在(或潜在的[1])不一致:如果normalize=True,则它将划分根据设计矩阵各列的 norm 而不是标准偏差.不论价格如何,关键字normalize=True都将从sklearn版本0.17中弃用.

解决方案:请勿使用standardize=True.而是,构建一个sklearn.pipeline.Pipeline并将sklearn.preprocessing.StandardScaler放在您的Lasso对象之前.这样,您甚至都不需要执行初始缩放.

请注意,套索的sklearn实现中的数据丢失项由n_samples缩放.因此,产生零解的最小惩罚为alpha_max = np.abs(X.T.dot(y)).max() / n_samples(对于normalize=False).

[1]我说潜在不一致,因为 normalize norm 一词相关,因此至少在语言上是一致的:)

[如果您不想要详细信息,请在此处停止阅读]

这里有一些复制和粘贴的代码可以重现该问题

import numpy as np
rng = np.random.RandomState(42)

n_samples, n_features, n_active_vars = 20, 10, 5
X = rng.randn(n_samples, n_features)
X = ((X - X.mean(0)) / X.std(0))

beta = rng.randn(n_features)
beta[rng.permutation(n_features)[:n_active_vars]] = 0.

y = X.dot(beta)

print X.std(0)
print X.mean(0)

from sklearn.linear_model import Lasso

lasso1 = Lasso(alpha=.1)
print lasso1.fit(X, y).coef_

lasso2 = Lasso(alpha=.1, normalize=True)
print lasso2.fit(X, y).coef_

为了了解发生了什么,现在观察一下

lasso1.fit(X / np.sqrt(n_samples), y).coef_ / np.sqrt(n_samples)

等于

lasso2.fit(X, y).coef_

因此,缩放设计矩阵并通过np.sqrt(n_samples)适当地重新缩放系数,可以将一个模型转换为另一个模型.这也可以通过对罚分采取行动来实现:惩罚为np.sqrt(n_samples)的套索估计量按np.sqrt(n_samples)比例缩小后的行为就像带有normalize=False的套索估计量(在您的数据类型上,即已经标准化为std=1 ).

lasso3 = Lasso(alpha=.1 / np.sqrt(n_samples), normalize=True)
print lasso3.fit(X, y).coef_  # yields the same coefficients as lasso1.fit(X, y).coef_

I have a matrix where each column has mean 0 and std 1

In [67]: x_val.std(axis=0).min()
Out[70]: 0.99999999999999922

In [71]: x_val.std(axis=0).max()
Out[71]: 1.0000000000000007

In [72]: x_val.mean(axis=0).max()
Out[72]: 1.1990408665951691e-16

In [73]: x_val.mean(axis=0).min()
Out[73]: -9.7144514654701197e-17

The number of non 0 coefficients changes if I use the normalize option

In [74]: l = Lasso(alpha=alpha_perc70).fit(x_val, y_val)

In [81]: sum(l.coef_!=0)
Out[83]: 47

In [84]: l2 = Lasso(alpha=alpha_perc70, normalize=True).fit(x_val, y_val)

In [93]: sum(l2.coef_!=0)
Out[95]: 3

It seems to me that normalize just set the variance of each columns to 1. This is strange that the results change so much. My data has already variance=1.

So what does normalize=T actually do?

解决方案

This is due to an (or a potential [1]) inconsistency in the concept of scaling in sklearn.linear_model.base.center_data: If normalize=True, then it will divide by the norm of each column of the design matrix, not by the standard deviation . For what it's worth, the keyword normalize=True will be deprecated from sklearn version 0.17.

Solution: Do not use standardize=True. Instead, build a sklearn.pipeline.Pipeline and prepend a sklearn.preprocessing.StandardScaler to your Lasso object. That way you don't even need to perform your initial scaling.

Note that the data loss term in the sklearn implementation of Lasso is scaled by n_samples. Thus the minimal penalty yielding a zero solution is alpha_max = np.abs(X.T.dot(y)).max() / n_samples (for normalize=False).

[1] I say potential inconsistency, because normalize is associated to the word norm and thus at least linguistically consistent :)

[Stop reading here if you don't want the details]

Here is some copy and pasteable code reproducing the problem

import numpy as np
rng = np.random.RandomState(42)

n_samples, n_features, n_active_vars = 20, 10, 5
X = rng.randn(n_samples, n_features)
X = ((X - X.mean(0)) / X.std(0))

beta = rng.randn(n_features)
beta[rng.permutation(n_features)[:n_active_vars]] = 0.

y = X.dot(beta)

print X.std(0)
print X.mean(0)

from sklearn.linear_model import Lasso

lasso1 = Lasso(alpha=.1)
print lasso1.fit(X, y).coef_

lasso2 = Lasso(alpha=.1, normalize=True)
print lasso2.fit(X, y).coef_

In order to understand what is going on, now observe that

lasso1.fit(X / np.sqrt(n_samples), y).coef_ / np.sqrt(n_samples)

is equal to

lasso2.fit(X, y).coef_

Hence, scaling the design matrix and appropriately rescaling the coefficients by np.sqrt(n_samples) converts one model to the other. This can also be achieved by acting on the penalty: A lasso estimator with normalize=True with its penalty scaled down by np.sqrt(n_samples) acts like a lasso estimator with normalize=False (on your type of data, i.e. already standardized to std=1).

lasso3 = Lasso(alpha=.1 / np.sqrt(n_samples), normalize=True)
print lasso3.fit(X, y).coef_  # yields the same coefficients as lasso1.fit(X, y).coef_

这篇关于在套索sklearn中,选项normalize = True是什么?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆