使用Scikit学习的加权线性回归 [英] Weighted linear regression with Scikit-learn

查看:127
本文介绍了使用Scikit学习的加权线性回归的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我的数据:

 状态N Var1 Var2阿拉巴马州23 54 42阿拉斯加4 53 53亚利桑那53 75 65 

Var1 Var2 是状态级别的汇总百分比值. N 是每个州的参与者人数.我想在 Var1 Var2 之间运行线性回归,并考虑在Python 2.7中使用sklearn将 N 作为权重.

总行是:

  fit(X,y [,sample_weight]) 

假设使用Pandas将数据加载到 df 中,并且 N 变为 df ["N"] ,我是否只适合数据进入下一行,还是在命令中将N用作 sample_weight 之前需要进行某种处理?

  fit(df ["Var1"],df ["Var2"],sample_weight = df ["N"]) 

解决方案

权重使得训练模型对于输入的某些值(例如,错误成本更高)更准确.在内部,权重 w 乘以损失函数[

因此,重要的是权重的相对比例.如果 N 已经反映了优先级,则可以照原样传递.统一缩放不会改变结果.

这里是一个例子.在加权版本中,我们强调最后两个样本周围的区域,并且该模型在那里变得更加准确.而且,缩放不会像预期的那样影响结果.

 将matplotlib.pyplot导入为plt将numpy导入为np来自sklearn导入数据集从sklearn.linear_model导入LinearRegression#加载糖尿病数据集X,y =数据集.load_diabetes(return_X_y = True)n_samples = 20#仅使用一项功能并进行排序X = X [:, np.newaxis,2] [:n_samples]y = y [:n_samples]p = X.argsort(轴= 0)X = X [p] .reshape((n_samples,1))y = y [p]#创建相等的权重,然后增加最后两个权重样本权重= np.ones(n_samples)* 20sample_weight [-2:] * = 30plt.scatter(X, y, s=sample_weight, c='grey', edgecolor='black')#未加权模型regr = LinearRegression()regr.fit(X,y)plt.plot(X,regr.predict(X),color ='blue',linewidth = 3,label ='Unweighted model')#加权模型regr = LinearRegression()regr.fit(X,y,sample_weight)plt.plot(X,regr.predict(X),color ='red',linewidth = 3,label ='Weighted model')#加权模型-比例权重regr = LinearRegression()sample_weight = sample_weight/sample_weight.max()regr.fit(X,y,sample_weight)plt.plot(X,regr.predict(X),color ='yellow',linewidth = 2,label ='Weighted model-scaled',linestyle ='dashed')plt.xticks(()); plt.yticks(()); plt.legend(); 

(此转换似乎也需要将 Var1Var2 传递给 fit)

My data:

State           N           Var1            Var2
Alabama         23          54              42
Alaska          4           53              53
Arizona         53          75              65

Var1 and Var2 are aggregated percentage values at the state level. N is the number of participants in each state. I would like to run a linear regression between Var1 and Var2 with the consideration of N as weight with sklearn in Python 2.7.

The general line is:

fit(X, y[, sample_weight])

Say the data is loaded into df using Pandas and the N becomes df["N"], do I simply fit the data into the following line or do I need to process the N somehow before using it as sample_weight in the command?

fit(df["Var1"], df["Var2"], sample_weight=df["N"])

解决方案

The weights enable training a model that is more accurate for certain values of the input (e.g., where the cost of error is higher). Internally, weights w are multiplied by the residuals in the loss function [1]:

Therefore, it is the relative scale of the weights that matters. N can be passed as is if it already reflects the priorities. Uniform scaling would not change the outcome.

Here is an example. In the weighted version, we emphasize the region around last two samples, and the model becomes more accurate there. And, scaling does not affect the outcome, as expected.

import matplotlib.pyplot as plt
import numpy as np
from sklearn import datasets
from sklearn.linear_model import LinearRegression

# Load the diabetes dataset
X, y = datasets.load_diabetes(return_X_y=True)
n_samples = 20

# Use only one feature and sort
X = X[:, np.newaxis, 2][:n_samples]
y = y[:n_samples]
p = X.argsort(axis=0)
X = X[p].reshape((n_samples, 1))
y = y[p]

# Create equal weights and then augment the last 2 ones
sample_weight = np.ones(n_samples) * 20
sample_weight[-2:] *= 30

plt.scatter(X, y, s=sample_weight, c='grey', edgecolor='black')

# The unweighted model
regr = LinearRegression()
regr.fit(X, y)
plt.plot(X, regr.predict(X), color='blue', linewidth=3, label='Unweighted model')

# The weighted model
regr = LinearRegression()
regr.fit(X, y, sample_weight)
plt.plot(X, regr.predict(X), color='red', linewidth=3, label='Weighted model')

# The weighted model - scaled weights
regr = LinearRegression()
sample_weight = sample_weight / sample_weight.max()
regr.fit(X, y, sample_weight)
plt.plot(X, regr.predict(X), color='yellow', linewidth=2, label='Weighted model - scaled', linestyle='dashed')
plt.xticks(());plt.yticks(());plt.legend();

(this transformation also seems necessary for passing Var1 and Var2 to fit)

这篇关于使用Scikit学习的加权线性回归的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆