当 p > n 时 sklearn 如何进行线性回归? [英] how does sklearn do Linear regression when p >n?

查看:33
本文介绍了当 p > n 时 sklearn 如何进行线性回归?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

众所周知,当变量数 (p) 大于样本数 (n) 时,最小二乘估计量未定义.

it's known that when the number of variables (p) is larger than the number of samples (n) the least square estimator is not defined.

在 sklearn 中,我收到以下值:

In sklearn I receive this values:

In [30]: lm = LinearRegression().fit(xx,y_train)

In [31]: lm.coef_
Out[31]: 
array([[ 0.20092363, -0.14378298, -0.33504391, ..., -0.40695124,
         0.08619906, -0.08108713]])

In [32]: xx.shape
Out[32]: (1097, 3419)

调用 [30] 应返回错误.在这种情况下,当 p>n 时 sklearn 是如何工作的?

Call [30] should return an error. How does sklearn work when p>n like in this case?

似乎矩阵填充了一些值

if n > m:
        # need to extend b matrix as it will be filled with
        # a larger solution matrix
        if len(b1.shape) == 2:
            b2 = np.zeros((n, nrhs), dtype=gelss.dtype)
            b2[:m,:] = b1
        else:
            b2 = np.zeros(n, dtype=gelss.dtype)
            b2[:m] = b1
        b1 = b2

推荐答案

当线性系统欠定时,sklearn.linear_model.LinearRegression 找到最小的 L2 范数解决方案,即

When the linear system is underdetermined, then the sklearn.linear_model.LinearRegression finds the minimum L2 norm solution, i.e.

argmin_w l2_norm(w) subject to Xw = y

通过将 X 的伪逆应用于 y,即

This is always well defined and obtainable by applying the pseudoinverse of X to y, i.e.

w = np.linalg.pinv(X).dot(y)

scipy.linalg.lstsq的具体实现,LinearRegression使用的是get_lapack_funcs(('gelss',), ...code> 正是通过奇异值分解(LAPACK 提供)找到最小范数解的求解器.

The specific implementation of scipy.linalg.lstsq, which is used by LinearRegression uses get_lapack_funcs(('gelss',), ... which is precisely a solver that finds the minimum norm solution via singular value decomposition (provided by LAPACK).

看看这个例子

import numpy as np
rng = np.random.RandomState(42)
X = rng.randn(5, 10)
y = rng.randn(5)

from sklearn.linear_model import LinearRegression
lr = LinearRegression(fit_intercept=False)
coef1 = lr.fit(X, y).coef_
coef2 = np.linalg.pinv(X).dot(y)

print(coef1)
print(coef2)

你会看到coef1 == coef2.(注意fit_intercept=False是在sklearn估计器的构造函数中指定的,否则它会在拟合模型之前减去每个特征的均值,产生不同的系数)

And you will see that coef1 == coef2. (Note that fit_intercept=False is specified in the constructor of the sklearn estimator, because otherwise it would subtract the mean of each feature before fitting the model, yielding different coefficients)

这篇关于当 p > n 时 sklearn 如何进行线性回归?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆