将系数向量分配回 scikit 中的特征学习套索 [英] Assigning coefficient vector back to features in scikit learn Lasso

查看:32
本文介绍了将系数向量分配回 scikit 中的特征学习套索的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在 scikit learn 中对数据集运行套索.这是我的设计矩阵(X)的样子:

I am running a Lasso in scikit learn on a dataset. Here is how my design matrix(X) looks like:

    Year    Country SW  NY.GDP.DEFL.KD.ZG.1 NY.GDP.PCAP.KD.ZG   NY.GDP.DEFL.KD.ZG   NE.IMP.GNFS.ZS  NY.GDP.DISC.CN  FS.AST.PRVT.GD.ZS   FS.AST.DOMS.GD.ZS   NY.GDS.TOTL.ZS  NY.GDP.DISC.KN  NY.GDP.NGAS.RT.ZS   NY.GDP.PETR.RT.ZS   NY.GDP.COAL.RT.ZS   NY.GDP.MINR.RT.ZS   NY.GDP.TOTL.RT.ZS   MS.MIL.XPND.GD.ZS
0   0   0   1   -3576217.383052 -5146876.546040 -3471506.772186 -2633821.885258 -3.680928e+06   91.575314   99.278420   -5670429.600369 -3.785639e+06   -4832744.713442 -5461008.378638 -3366796.16132  -3995059.826515 -5565718.989504 -1691426.387465
1   1   0   1   5.713486    0.563529    4.713486    21.969161   -5.000000e+06   88.625556   92.244479   23.625253   1.309500e+10    1.089173    0.983267    0.00000 1.471053    3.860570    2.057921
2   2   0   1   3.559686    2.640931    2.559686    21.466621   -1.000000e+06   87.785550   93.413707   24.273287   1.558700e+10    1.014641    1.021970    0.00000 1.371797    3.681716    1.925137
3   3   0   1   1.337874    3.811404    0.337874    20.646004   1.000000e+06    84.262083   91.313310   23.840716   1.962200e+10    0.445549    0.412880    0.00000 1.079369    2.178213    1.994438
4   0   1   1   7.638720    9.914861    6.638720    25.640006   -1.305679e+11   129.923249  146.277785  51.979295   -6.818467e+11   0.164374    1.500932    2.37375 2.563449    6.954085    2.079635

一开始它具有三个分类特征.

It has three categorical features in the beginning.

这是我的目标向量(Y)的样子:

Here is how my Target vector(Y) looks like:

0   -0.003094
1   -0.015327
2    0.100617
3    0.067728
4    0.089962

两者目前都是熊猫数据框/系列.

Both are currently pandas data frame/Series.

现在我使用 scikit.from 的 oneHotEncoder 在 X 中重新编码我的分类变量

Now I Recode my categorical variables in X using oneHotEncoder of scikit.from

sklearn import preprocessing
X_train=preprocessing.OneHotEncoder(categorical_features=[0,1,2],sparse=False).fit_transform(data_train)

这会将数据转换成这样:

This transforms the data to something like this:

X_train[0:2]
Out[473]:
array([[  1.00000000e+00,   0.00000000e+00,   0.00000000e+00,
          0.00000000e+00,   1.00000000e+00,   0.00000000e+00,
          0.00000000e+00,   0.00000000e+00,   0.00000000e+00,
          0.00000000e+00,   0.00000000e+00,   0.00000000e+00,
          0.00000000e+00,   0.00000000e+00,   1.00000000e+00,
         -3.57621738e+06,  -5.14687655e+06,  -3.47150677e+06,
         -2.63382189e+06,  -3.68092799e+06,   9.15753144e+01,
          9.92784200e+01,  -5.67042960e+06,  -3.78563860e+06,
         -4.83274471e+06,  -5.46100838e+06,  -3.36679616e+06,
         -3.99505983e+06,  -5.56571899e+06,  -1.69142639e+06],
       [  0.00000000e+00,   1.00000000e+00,   0.00000000e+00,
          0.00000000e+00,   1.00000000e+00,   0.00000000e+00,
          0.00000000e+00,   0.00000000e+00,   0.00000000e+00,
          0.00000000e+00,   0.00000000e+00,   0.00000000e+00,
          0.00000000e+00,   0.00000000e+00,   1.00000000e+00,
          5.71348642e+00,   5.63529053e-01,   4.71348642e+00,
          2.19691610e+01,  -5.00000000e+06,   8.86255560e+01,
          9.22444788e+01,   2.36252526e+01,   1.30950000e+10,
          1.08917343e+00,   9.83266854e-01,   0.00000000e+00,
          1.47105308e+00,   3.86057046e+00,   2.05792067e+00]])

在此之后我做缺失值插补:

After this I do missing value imputation:

X_imputed=preprocessing.Imputer().fit_transform(X_train) 
X_imputed[0:1]
Out[474]:
array([[       1.        ,        0.        ,        0.        ,
               0.        ,        1.        ,        0.        ,
               0.        ,        0.        ,        0.        ,
               0.        ,        0.        ,        0.        ,
               0.        ,        0.        ,        1.        ,
        -3576217.38305151, -5146876.54603993, -3471506.77218561,
        -2633821.88525845, -3680927.9939174 ,       91.57531444,
              99.27842   , -5670429.60036941, -3785638.6047833 ,
        -4832744.71344225, -5461008.37863762, -3366796.16131972,
        -3995059.82651509, -5565718.98950351, -1691426.3874654 ]])

到目前为止,我已经开始对变量的顺序感到困惑,因为在使用 oneHotencoder 后,我的数据框被转换为 numpy 数组并去除标题.所以我不确定前 13 列(三个分类的假人是什么以及按什么顺序?.

By Now I have started getting confused with the order of variables as after using oneHotencoder my data frame is converted to numpy array and strips the headers. SO I am not sure first 13 columns(which are dummies for three categoricals are what and in which order?.

其次,我继续运行 LassoCV 以获得正确的 Lasso alpha 值和相应的系数.

Secondly, I go ahead and run LassoCV to get the right alpha value of Lasso and corresponding coefficients.

from sklearn import linear_model 
lasso=linear_model.LassoCV(max_iter=2000,cv=10,normalize=True)
lasso.fit(X_imputed,Y_train)

当我使用交叉验证检查它最终选择了哪个 alpha 值时它给了这个:

When I check which alpha value it finally chose using cross validation it gave this:

lasso.alpha_
Out[476]:
4.1303618102099771e-05

所以我假设这个 alpha 值是最好的,它在所有 10 折中给出最小的 MSE.

So am assuming this alpha value is the best one which gives least MSE over all 10 folds.

但是现在当我尝试为它尝试的所有 alpha 找到套索路径时,这就是我得到的.我正在创建一个 numpy 数组来为套索选择的每个 alpha 存储所有 10 倍的 MSE(100 个 alphas 为 10 倍)

But now when I try to find the lasso path for all alphas it tried, here is what I get. I am creating a numpy array to store MSE of all 10 fold for each alpha chosen by lasso (100 alphas for 10 folds)

scores=np.zeros((100,2))
scores[:,0]=lasso.mse_path_[:,0]
scores[:,1]=np.mean(lasso.mse_path_[:,1:],axis=1)
scr=scores[scores[:,1].argsort()]

因为我已经按照每个 alpha 的 MSE 升序对我的分数矩阵进行了排序,所以我希望第一条记录向我显示分数为 min 的 alpha.

Since I have sorted my scores matrix in ascending order of MSE for each alpha , I expect the first record to show me the alpha for which the score is min.

scr[0]
Out[477]:
array([ 441334.91133953,       0.00739538])

但我认为 alpha 值与我在上述步骤中使用 lasso.alpha_ 得到的完全不同.那是-5的幂,这是+5的幂.这是为什么?.

But I see the alpha value as totally different from what I got in above step using lasso.alpha_. That was to power of -5 and this is power of +5. Why is that?.

第三,这是我的套索系数向量.我如何知道哪个系数映射到我的原始数据集 (data_train) 中的哪个特征?这就是我最终需要从最佳选择的 alpha 中获得与每个特征相对应的权重.

Thirdly, here is my coefficient vector from lasso. How do I know which coefficient is mapped to which feature in my original data set (data_train)?. THis is what I need in the end to get the weights corresponding to each feature from the best chose alpha.

lasso.coef_
Out[478]:
array([ 0.02930289,  0.01039652, -0.        , -0.05448752,  0.01310975,
        0.        , -0.03755883,  0.02754805, -0.0498908 , -0.10531218,
       -0.08303772,  0.00465392,  0.        , -0.04597282,  0.        ,
        0.00000003,  0.        ,  0.        ,  0.        ,  0.        ,
       -0.00101291,  0.00155892,  0.        ,  0.        ,  0.        ,
        0.        , 

现在因为标题被剥离了,我不知道哪个权重对应哪个特征.另外,为什么当我选择 lasso.alpha_ 或当我做 lasso_mse_path_ 并检查最低 mse 时 alpha 值不同.

Right now because the headers are stripped and all , I have no clue which weights correspond to which feature. Also, why the alpha value is different when I choose lasso.alpha_ or when I do lasso_mse_path_ and check the lowest mse.

有什么想法吗?

推荐答案

要将特征索引关联回原始特征列,您可以使用 OneHotEncoderfeature_indices_ 属性> 拟合后:

To relate the feature indices back to the original feature columns, you can use the feature_indices_ attribute of OneHotEncoder after fitting:

from sklearn import preprocessing
encoder = preprocessing.OneHotEncoder(categorical_features=[0,1,2])
X_train = encoder.fit_transform(data_train)
print encoder.feature_indices_

输出:

[0 4 6 8]

根据文档:

feature_indices_ : 形状数组 (n_features,)特征范围的索引.原始数据中的特征 i 映射到从 feature_indices_[i] 到 feature_indices_[i+1] 的特征(然后可能被 active_features_ 屏蔽)

feature_indices_ : array of shape (n_features,) Indices to feature ranges. Feature i in the original data is mapped to features from feature_indices_[i] to feature_indices_[i+1] (and then potentially masked by active_features_ afterwards)

在这种情况下,one-hot编码空间的前4个维度对应列Year,接下来的2个对应列Country,最后2个对应于列软件.

In this case, the first 4 dimensions in the one-hot encoded space correspond to column Year, the next 2 correspond to column Country and the last 2 correspond to SW.

这篇关于将系数向量分配回 scikit 中的特征学习套索的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆