Python 和 R 之间线性回归系数之间的差异 [英] Difference between Linear Regression Coefficients between Python and R

查看:60
本文介绍了Python 和 R 之间线性回归系数之间的差异的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试在 Python 中运行我已经在 R 中完成的线性回归,以便找到系数为 0 的变量.我遇到的问题是 R 中的线性回归返回低方差列的 NA,而 scikit learn 回归返回系数.在 R 代码中,我通过将带有 NA 的变量保存为线性回归的输出来找到并保存这些变量,但我似乎无法找到一种在 python 中模仿这种行为的方法.我正在使用的代码可以在下面找到.

I'm trying to run a linear regression in Python that I have already done in R in order to find variables with 0 coefficients. The issue I'm running into is that the linear regression in R returns NAs for columns with low variance while the scikit learn regression returns the coefficients. In the R code, I find and save these variables by saving the variables with NAs as output from the linear regression, but I can't seem to figure out a way to mimic this behavior in python. The code I'm using can be found below.

R 代码:

a <- c(23, 45, 546, 42, 68, 15, 47)
b <- c(1, 2, 4, 6, 34, 2, 8)
c <- c(22, 33, 44, 55, 66, 77, 88)
d <- c(1, 1, 1, 1, 1, 1, 1)
e <- c(1, 1, 1, 1, 1, 1, 1.1)
f <- c(1, 1, 1, 1, 1, 1, 1.01)
g <- c(1, 1, 1, 1, 1, 1, 1.001)

df <- data.frame(a, b, c, d, e, f, g)
var_list = c('b', 'c', 'd', 'e', 'f', 'g')

target <- temp_dsin.df$a
reg_data <- cbind(target, df[, var_list])


if (nrow(reg_data) < length(var_list)){
  message(paste0('    WARNING: Data set is rank deficient. Result may be doubtful'))
}
reg_model <- lm(target ~ ., data = reg_data)

print(reg_model$coefficients)

#store the independent variables with 0 coefficients
zero_coef_IndepVars.v <- names(which(is.na(reg_model$coefficients)))

print(zero_coef_IndepVars.v)

Python 代码:

import pandas as pd
from sklearn import linear_model

a = [23, 45, 546, 42, 68, 15, 47]
b = [1, 2, 4, 6, 34, 2, 8]
c = [22, 33, 44, 55, 66, 77, 88]
d = [1, 1, 1, 1, 1, 1, 1]
e = [1, 1, 1, 1, 1, 1, 1.1]
q = [1, 1, 1, 1, 1, 1, 1.01]
f = [1, 1, 1, 1, 1, 1, 1.001]


df = pd.DataFrame({'a': a,
                             'b': b,
                             'c': c,
                             'd': d,
                             'e': e,
                             'f': q,
                             'g': f})


var_list = ['b', 'c', 'd', 'e', 'f', 'g']

# build linear regression model and test for linear combination
target = df['a']
reg_data = pd.DataFrame()
reg_data['a'] = target
train_cols = df.loc[:,df.columns.str.lower().isin(var_list)]


if reg_data.shape[0] < len(var_list):
    print('    WARNING: Data set is rank deficient. Result may be doubtful')

# Create linear regression object
reg_model = linear_model.LinearRegression()

# Train the model using the training sets
reg_model.fit(train_cols , reg_data['a'])

print(reg_model.coef_)

来自 R 的输出:

(Intercept)           b           c           d           e           f           g 
 537.555988   -0.669253   -1.054719          NA -356.715149          NA          NA 

> print(zero_coef_IndepVars.v)
[1] "d" "f" "g"

Python 输出:

           b             c   d               e              f            g
[-0.66925301   -1.05471932   0.   -353.1483504   -35.31483504   -3.5314835]

如您所见,b"、c"和e"列的值很接近,但d"、f"和g"列的值却大不相同.对于此示例回归,我想返回 ['d', 'f', 'g'] 因为它们的输出是来自 R 的 NA.问题是 sklearn 线性回归为 col 'd' 返回 0,而它返回-35.31 为 col 'f' 和 -3.531 为 col 'g'.

As you can see, the values for columns 'b', 'c', and 'e' are close, but very different for 'd', 'f', and 'g'. For this example regression, I would want to return ['d', 'f', 'g'] as their outputs are NA from R. The issue is that the sklearn linear regression returns 0 for col 'd', while it returns -35.31 for col 'f' and -3.531 for col 'g'.

有谁知道 R 如何决定是返回 NA 还是返回值/如何将这种行为实现到 Python 版本中?了解差异的来源可能会帮助我在 python 中实现 R 行为.我需要 python 脚本的结果与 R 输出完全匹配.

Does anyone know how R decides on whether to return NA or a value/how to implement this behavior into the Python version? Knowing where the differences stem from will likely help me implement the R behavior in python. I need the results of the python script to match the R outputs exactly.

推荐答案

这是实现上的差异.R 中的 lm 使用基于 QR 分解的底层 C 代码.模型矩阵被分解为正交矩阵 Q 和三角矩阵 R.这导致了其他人所谓的共线性检查".R 不会检查这一点,QR 分解的性质确保了最不共线的变量在拟合算法中获得优先级".

It's a difference in implementation. lm in R uses underlying C code that is based on a QR decomposition. The model matrix is decomposed into an orthogonal matrix Q and a triangular matrix R. This causes what others called "a check on collinearity". R doesn't check that, the nature of the QR decomposition assures that the least collinear variables are getting "priority" in the fitting algorithm.

有关线性回归上下文中 QR 分解的更多信息:https://www.stat.wisc.edu/~larget/math496/qr.html

More info on QR decomposition in the context of linear regression: https://www.stat.wisc.edu/~larget/math496/qr.html

来自 sklearn 的代码基本上是对 numpy.linalg.lstsq 的包装,它最小化了欧几里得二次范数.如果您的模型是 Y = AX,它会最小化 ||Y - AX||^2.这是一种不同的(计算上不太稳定)算法,它没有 QR 分解的副作用.

The code from sklearn is basically a wrapper around numpy.linalg.lstsq, which minimizes the Euclidean quadratic norm. If your model is Y = AX, it minimizes ||Y - AX||^2. This is a different (and computationally less stable) algorithm, and it doesn't have the nice side effect of the QR decomposition.

个人说明:如果您想在经过验证和测试的计算框架中对模型进行稳健拟合并坚持使用 Python,请寻找基于 QR 或 SVD 的线性回归实现.软件包 scikit-learnstatsmodels(截至 2017 年 4 月 22 日仍处于测试阶段)应该可以帮助您实现目标.

Personal note: if you want to have robust fitting of models in a proven and tested computational framework and insist on using Python, look for linear regression implementations that are based on QR or SVD. The packages scikit-learn or statsmodels (still in beta as per 22 april 2017) should get you there.

这篇关于Python 和 R 之间线性回归系数之间的差异的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆