多元线性回归中的虚拟变量 [英] Dummy Variable in Multiple Linear Regression

查看:889
本文介绍了多元线性回归中的虚拟变量的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

为什么在多元线性回归模型中,虚拟变量要比虚拟变量总数少一个?

Why do we take one less dummy variable than the total number of dummy variables in a Multiple Linear regression model?

就像,如果模型包含4个虚拟变量,我们将更新特征向量以训练回归模型. x = x[:, 1:4].

Like, if the model contains 4 dummy variables, we update our features vector for training our regression model. x = x[:, 1:4].

推荐答案

由于

通过在回归模型中包含虚拟变量,应该 注意虚拟变量陷阱.虚拟变量陷阱是一个 自变量为多重共线性的场景-a 两个或多个变量高度相关的情况;在 简单来说,一个变量可以从其他变量中预测出来.

By including dummy variable in a regression model however, one should be careful of the Dummy Variable Trap. The Dummy Variable trap is a scenario in which the independent variables are multicollinear - a scenario in which two or more variables are highly correlated; in simple terms one variable can be predicted from the others.

假设您有一个简单的类别,例如性别,类别为男性"和女性".您将获得两个虚拟变量«male»和«female»,它们可以为true或false.这完全是多余的,因为您可以相互预测.

Let's say you have a simple categorical like gender, with categories «male» and «female». You get two dummy variables «male» and «female», which can either be true or false. This simply is redundant because you can predict one from the other.

在另一个示例中:当您有四个类别A/B/C/D时,您将获得四个虚拟变量.如果您知道该类不是A,B或C,则知道它必须为D.因此,您可以并且应该删除一个哑变量.

In another example: When you have four categoricals A/B/C/D, you get four dummy variables. If you know that the class is not A, B or C, you know it must be D. Therefore you can and should drop one dummy variable.

从技术上讲,虚拟变量陷阱是一种情况,其中自变量为多共线性-两个或多个变量高度相关.这将导致您的回归算法出现问题:

Technically, the dummy variable trap is a scenario in which the independent variables are multi-collinear - two or more variables are highly correlated. This will lead to problems in your regression algorithm:

在这种情况下,多元回归的系数估计 可能会因模型中的细微变化而发生不规律的变化,或者 数据.

In this situation the coefficient estimates of the multiple regression may change erratically in response to small changes in the model or the data.

基线:在对具有N个可能值的分类变量建模时,应使用N-1个虚拟变量.

Baseline: When modelling a categorical variable with N possible values, you should use N−1 dummy variables.

这篇关于多元线性回归中的虚拟变量的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆