R-线性回归-变量控制 [英] R - Linear Regression - Control for a variable

查看:153
本文介绍了R-线性回归-变量控制的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我具有计算机科学背景和我正在尝试通过解决互联网上存在的问题来自学数据科学

I have a computer science background & I am trying to teach myself data science by solving the problems available on the internet

我的数据集很小,有3个变量-种族,性别和年收入.大约有10,000个样本观测值.我正在尝试预测来自种族的收入.性别.

I have a smallish data set which has 3 variables - race, gender and annual income. There are about 10,000 sample observations. I am trying to predict income from race & gender.

我已将数据分为两部分-每个性别和现在,我正在尝试创建2个回归模型.在R中这可能吗?有人可以提供示例语法.

I have divided the data into 2 parts - one for each gender & now I am trying to create 2 regression models. Is this possible in R? Can some one provide example syntax.

推荐答案

您没有指定数据的存储方式或变量种族的记录方式(这是因素吗?)

You don't specify how your data are stored or how the variable race is recorded (is it a factor?)

[例如,如果您只是将收入与男性种族相对应,并且您的男性收入和种族分别在income.mrace.m中,并且第二个是R中的 factor ,则lm(income.m~race.m)将适合男性的行(在结果对象上使用summary以获得有关它的信息).您可以为女性做类似的事情.但是大多数人不会以这种方式拟合模型.]

[If you're just fitting income against race for males, say, and you had the male income and race in income.m and race.m and if the second was a factor in R, then lm(income.m~race.m) will fit the line for males (use summary on the resulting object to get information about it). You could do something similar for females. But most people won't fit the models this way.]

如果您准备假定两条线的变化对于两个性别都是相同的,则可以使用一个模型来拟合两条线.

If you're prepared to assume that the variation about the lines is the same for both genders, you can fit both lines with one model.

与单独分析这些行相比,它具有多个优点,尽管也可以做到.

This has several advantages over analyzing the lines separately, though that can also be done.

如果性别是一个因子或记录为(0/1)的数字变量,而种族是一个因子,并且您将数据保存在数据框中(例如,称为incdata),那么您将同时使两行相符:

If gender is either a factor or a numeric variable recorded as (0/1), and race is a factor and you have the data in a data frame (called, for example, incdata), then you'd fit both lines at once with:

lm(income~race*gender, data=incdata)

是R的简写

lm(income~race+gender+race:gender, data=incdata)

其中,race:gender是一个交互项.

where race:gender is an interaction term.

如果您进一步假设性别对男女的影响相同,则使用较小的模型:

If you further assume that the effect of gender is the same for both sexes, then the smaller model:

lm(income~race+gender, data=incdata)

将代替使用.通常,如果要求人们控制性别",这将是人们适合的模型,尽管许多人会考虑我之前提到的交互模型.

would be used instead. This would often be the model people would fit if asked to 'control for gender', though many would consider the interaction model I mentioned before instead.

我强烈建议您先处理更简单的回归问题,并使用适合指导您完成构想的教科书或注释集.

I'd strongly advise working on more simple regression problems first, with a textbook or set of notes suitable for guiding you through the ideas.

如果您尚未在R中拟合回归,那么我将从一个较小的数据集开始,该数据集只有一个预测变量,以适应​​基本原理.

If you haven't already fitted a regression in R, I'd start with a smaller data set that only has a single predictor just to get used to the basic mechanics.

R带有许多内置的数据集.例如,参见library(help=datasets),它具有约80个数据集;例如,library(help=datasets)具有大约80个数据集. R附带的某些软件包有更多(例如,MASS具有80多个). CRAN上的许多R程序包都包装有数据集,其中许多都适合回归.

R comes with many data sets already built in. See, for example, library(help=datasets) which has about 80 data sets; some of the packages that come with R have more (MASS has over 80, for example). Many R packages on CRAN are packed with data sets, many suitable for regression.

例如,cars数据集(请参见R中的?cars)记录了给定速度的汽车的停车距离.您无需读取数据,因为它已经存在.

For example, the cars data set (see ?cars in R) records the stopping distance of cars, given their speed. You don't need to read the data in, it's already there.

一个简单的线性回归(不一定是最好的模型,只要对物理有所了解,但对于数据来说就足够了)

A simple linear regression (not necessarily the best model given some understanding of physics, but just about adequate for the data) would be:

lm(dist~speed, cars)

同样,您使用summary进行检查.例如(我建议您一次输入一个):

Again, you use summary to examine it. e.g. (I suggest you type these one at a time):

carsfit<-lm(dist〜speed,cars) 摘要(carsfit) 情节(距离〜速度,汽车) abline(carsfit,col = 2)

carsfit <- lm(dist~speed, cars) summary(carsfit) plot(dist~speed, cars) abline(carsfit, col=2)

关于汽车数据集(?cars)的帮助中的示例提供了其他几种模型和曲线图.您也可以一次尝试那些.

The examples in the help on the cars data set (?cars) gives several other models and plots. You might try those one at a time also.

car程序包(CAR是"Companion to Applied Regression"的缩写)具有许多专门用于回归的小型数据集.

The car package (CAR is short for "Companion to Applied Regression") has many small data sets specifically for regression.

这篇关于R-线性回归-变量控制的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆