在Python中建模时检测mulicollinear或具有线性组合的列:LinAlgError [英] Detecting mulicollinear , or columns that have linear combinations while modelling in Python : LinAlgError

查看：294 发布时间：2020/5/4 3:16:42 python-2.7 numpy statsmodels logistic-regression singular

本文介绍了在Python中建模时检测mulicollinear或具有线性组合的列:LinAlgError的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在为具有34个因变量的logit模型建模数据，并且不断抛出奇异矩阵错误，如下所示:-

I am modelling data for a logit model with 34 dependent variables,and it keep throwing in the singular matrix error , as below -:

Traceback (most recent call last):
  File "<pyshell#1116>", line 1, in <module>
    test_scores  = smf.Logit(m['event'], train_cols,missing='drop').fit()
  File "/usr/local/lib/python2.7/site-packages/statsmodels-0.5.0-py2.7-linux-i686.egg/statsmodels/discrete/discrete_model.py", line 1186, in fit
    disp=disp, callback=callback, **kwargs)
  File "/usr/local/lib/python2.7/site-packages/statsmodels-0.5.0-py2.7-linux-i686.egg/statsmodels/discrete/discrete_model.py", line 164, in fit
    disp=disp, callback=callback, **kwargs)
  File "/usr/local/lib/python2.7/site-packages/statsmodels-0.5.0-py2.7-linux-i686.egg/statsmodels/base/model.py", line 357, in fit
    hess=hess)
  File "/usr/local/lib/python2.7/site-packages/statsmodels-0.5.0-py2.7-linux-i686.egg/statsmodels/base/model.py", line 405, in _fit_mle_newton
    newparams = oldparams - np.dot(np.linalg.inv(H),
  File "/usr/local/lib/python2.7/site-packages/numpy/linalg/linalg.py", line 445, in inv
    return wrap(solve(a, identity(a.shape[0], dtype=a.dtype)))
  File "/usr/local/lib/python2.7/site-packages/numpy/linalg/linalg.py", line 328, in solve
    raise LinAlgError, 'Singular matrix'
LinAlgError: Singular matrix

我是在什么时候才采用这种方法将矩阵简化为独立的列

Which was when I stumpled on this method to reduce the matrix to its independent columns

def independent_columns(A, tol = 0):#1e-05):
    """
    Return an array composed of independent columns of A.

    Note the answer may not be unique; this function returns one of many
    possible answers.

    https://stackoverflow.com/q/13312498/190597 (user1812712)
    http://math.stackexchange.com/a/199132/1140 (Gerry Myerson)
    http://mail.scipy.org/pipermail/numpy-discussion/2008-November/038705.html
        (Anne Archibald)

    >>> A = np.array([(2,4,1,3),(-1,-2,1,0),(0,0,2,2),(3,6,2,5)])
    2 4 1 3
    -1 -2 1 0
    0 0 2 2
    3 6 2 5
    # try with checking the rank of matrixs 
    >>> independent_columns(A)
    np.array([[1, 4],
              [2, 5],
              [3, 6]])
    """
    Q, R = linalg.qr(A)
    independent = np.where(np.abs(R.diagonal()) > tol)[0]
    #print independent
    return A[:, independent], independent


A,independent_col_indexes=independent_columns(train_cols.as_matrix(columns=None)) 
#train_cols will not be converted back from a df to a  matrix object,so doing this explicitly
A2=pd.DataFrame(A, columns=train_cols.columns[independent_col_indexes])

test_scores = smf.Logit(m['event'],A2,missing='drop').fit()

我仍然收到LinAlgError，尽管我希望现在矩阵级别降低.

I still get the LinAlgError , though I was hoping I will have the reduced matrix rank now.

此外，我看到np.linalg.matrix_rank(train_cols)返回33(即，在调用Independent_columns函数之前，总"x"列为34(即，len(train_cols.ix[0])=34)，这意味着我没有完整的排名矩阵)，而np.linalg.matrix_rank(A2)返回33(表示我已经删除了列，但仍然看到LinAlgError，当我运行test_scores = smf.Logit(m['event'],A2,missing='drop').fit()时，我缺少了什么?

Also, I see np.linalg.matrix_rank(train_cols) returns 33 (ie. before calling on the independent_columns function total "x" columns was 34(ie, len(train_cols.ix[0])=34 ), meaning I don't have a full rank matrix), while np.linalg.matrix_rank(A2) returns 33 (meaning I have dropped a columns, and yet I still see the LinAlgError , when I run test_scores = smf.Logit(m['event'],A2,missing='drop').fit() , what am I missing ?

参考上面的代码- 如何查找退化的协方差矩阵中的行/列

我试图通过一次引入每个变量来开始建立模型，这不会给我单一的矩阵误差，但是我宁愿有一个确定性的方法，让我知道我是什么.做错了如何消除这些列.

I tried to start building the model forward,by introducing each variable at a time, which doesn't give me the singular matrix error, but I would rather have a method that is deterministic, and lets me know, what am I doing wrong & how to eliminate these columns.

编辑(已通过@发布更新建议下面的user333700)

Edit (updated post the suggestions by @ user333700 below)

1..您是对的，"A2"的排名没有降低到33. IE. len(A2.ix[0]) =34->表示可能的共线列未删除-我是否应将"tol"(获得A2的等级(及其列数)的公差)增加为33.如果将tol更改为"1e-05" "，那么我确实得到了len(A2.ix[0]) =33，这向我暗示tol> 0(严格来说)是一个指标. 在此之后，我只是执行了test_scores = smf.Logit(m['event'],A2,missing='drop').fit()，而没有进行nm收敛.

1. You are right, "A2" doesn't have the reduced rank of 33 . ie. len(A2.ix[0]) =34 -> meaning the possibly collinear columns are not dropped - should I increase the "tol", tolerance to get rank of A2 (and the numbers of columns thereof) , as 33. If I change the tol to "1e-05" above, then I do get len(A2.ix[0]) =33, which suggests to me that tol >0 (strictly) is one indicator. After this I just did the same, test_scores = smf.Logit(m['event'],A2,missing='drop').fit(), without nm to get the convergence.

2..尝试"nm"方法后出错.不过，奇怪的是，如果我只进行20,000行，我确实会得到结果.由于它没有显示出内存错误，而是"Inverting hessian failed, no bse or cov_params available"-我假设，有多个几乎相似的记录-您会说什么?

2. Errors post trying 'nm' method. Strange thing though is that if I take just 20,000 rows, I do get the results. Since it is not showing up Memory error, but "Inverting hessian failed, no bse or cov_params available" - I am assuming, there are multiple nearly-similar records - what would you say ?

m  = smf.Logit(data['event_custom'].ix[0:1000000] , train_cols.ix[0:1000000],missing='drop')
test_scores=m.fit(start_params=None,method='nm',maxiter=200,full_output=1)
Warning: Maximum number of iterations has been exceeded

Warning (from warnings module):
  File "/usr/local/lib/python2.7/site-packages/statsmodels-0.5.0-py2.7-linux-i686.egg/statsmodels/base/model.py", line 374
    warn(warndoc, Warning)
Warning: Inverting hessian failed, no bse or cov_params available


test_scores.summary()

Traceback (most recent call last):
  File "<pyshell#17>", line 1, in <module>
    test_scores.summary()
  File "/usr/local/lib/python2.7/site-packages/statsmodels-0.5.0-py2.7-linux-i686.egg/statsmodels/discrete/discrete_model.py", line 2396, in summary
    yname_list)
  File "/usr/local/lib/python2.7/site-packages/statsmodels-0.5.0-py2.7-linux-i686.egg/statsmodels/discrete/discrete_model.py", line 2253, in summary
    use_t=False)
  File "/usr/local/lib/python2.7/site-packages/statsmodels-0.5.0-py2.7-linux-i686.egg/statsmodels/iolib/summary.py", line 826, in add_table_params
    use_t=use_t)
  File "/usr/local/lib/python2.7/site-packages/statsmodels-0.5.0-py2.7-linux-i686.egg/statsmodels/iolib/summary.py", line 447, in summary_params
    std_err = results.bse
  File "/usr/local/lib/python2.7/site-packages/statsmodels-0.5.0-py2.7-linux-i686.egg/statsmodels/tools/decorators.py", line 95, in __get__
    _cachedval = self.fget(obj)
  File "/usr/local/lib/python2.7/site-packages/statsmodels-0.5.0-py2.7-linux-i686.egg/statsmodels/base/model.py", line 1037, in bse
    return np.sqrt(np.diag(self.cov_params()))
  File "/usr/local/lib/python2.7/site-packages/statsmodels-0.5.0-py2.7-linux-i686.egg/statsmodels/base/model.py", line 1102, in cov_params
    raise ValueError('need covariance of parameters for computing '
ValueError: need covariance of parameters for computing (unnormalized) covariances

(已通过以下@ user333700发布了建议)

Edit 2: (updated post the suggestions by @user333700 below)

重申我要建模的内容-不到总数的1％用户转化"(成功的结果)-因此，我采取了均衡的 35(+ ve)/65(-ve)

Reiterating what I am trying to model - less than about 1% of total users "convert" (success outcomes) - so I took a balanced sample of 35(+ve) /65 (-ve)

我怀疑该模型虽然收敛，但并不健壮.因此，将使用"start_params"作为来自不同数据集的早期迭代的参数. 此编辑是关于确认是否可以将"start_params"输入到结果中，如下所示:

I suspect the model is not robust, though it converges. So, will use "start_params" as the params from earlier iteration, from a different dataset. This edit is about confirming is the "start_params" can feed into the results as below -:

A,independent_col_indexes=independent_columns(train_cols.as_matrix(columns=None))
A2=pd.DataFrame(A, columns=train_cols.columns[independent_col_indexes])
m  = smf.Logit(data['event_custom'], A2,missing='drop')
#m  = smf.Logit(data['event_custom'], train_cols,missing='drop')#,method='nm').fit()#This doesnt work, so tried 'nm' which work, but used lasso, as nm did not converge.
test_scores=m.fit_regularized(start_params=None, method='l1', maxiter='defined_by_method', full_output=1, disp=1, callback=None, alpha=0, \
trim_mode='auto', auto_trim_tol=0.01, size_trim_tol=0.0001, qc_tol=0.03)

a_good_looking_previous_result.params=test_scores.params #storing the parameters of pass1 to feed into pass2

test_scores.params
bidfloor_Quartile_modified_binned_0               0.305765
connectiontype_binned_0                          -0.436798
day_custom_binned_Fri                            -0.040269
day_custom_binned_Mon                             0.138599
day_custom_binned_Sat                            -0.319997
day_custom_binned_Sun                            -0.236507
day_custom_binned_Thu                            -0.058922
user_agent_device_family_binned_iPad            -10.793270
user_agent_device_family_binned_iPhone           -8.483099
user_agent_masterclass_binned_apple               9.038889
user_agent_masterclass_binned_generic            -0.760297
user_agent_masterclass_binned_samsung            -0.063522
log_height_width                                  0.593199
log_height_width_ScreenResolution                -0.520836
productivity                                     -1.495373
games                                             0.706340
entertainment                                    -1.806886
IAB24                                             2.531467
IAB17                                             0.650327
IAB14                                             0.414031
utilities                                         9.968253
IAB1                                              1.850786
social_networking                                -2.814148
IAB3                                             -9.230780
music                                             0.019584
IAB9                                             -0.415559
C(time_day_modified)[(6, 12]]:C(country)[AUS]    -0.103003
C(time_day_modified)[(0, 6]]:C(country)[HKG]      0.769272
C(time_day_modified)[(6, 12]]:C(country)[HKG]     0.406882
C(time_day_modified)[(0, 6]]:C(country)[IDN]      0.073306
C(time_day_modified)[(6, 12]]:C(country)[IDN]    -0.207568
C(time_day_modified)[(0, 6]]:C(country)[IND]      0.033370
... more params here

现在在不同的数据集(pass2，用于索引)上，我的模型如下所示: IE.我读取了一个新的数据框，进行了所有变量转换，然后通过Logit进行了建模(如前所述).

Now on a different dataset(pass2, for indexing), I model the same as below -: ie. I read a new dataframe, do all the variable transformation and then model via Logit as earlier .

m_pass2  = smf.Logit(data['event_custom'], A2_pass2,missing='drop')
test_scores_pass2=m_pass2.fit_regularized(start_params=a_good_looking_previous_result.params, method='l1', maxiter='defined_by_method', full_output=1, disp=1, callback=None, alpha=0, \
trim_mode='auto', auto_trim_tol=0.01, size_trim_tol=0.0001, qc_tol=0.03)

，并可能通过从较早的遍历中获取"start_params"来保持迭代.

and, possibly keep iterating by picking up "start_params" from earlier passes.

在Python中建模时检测mulicollinear或具有线性组合的列:LinAlgError [英] Detecting mulicollinear , or columns that have linear combinations while modelling in Python : LinAlgError

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

在Python中建模时检测mulicollinear或具有线性组合的列:LinAlgError [英] Detecting mulicollinear , or columns that have linear combinations while modelling in Python : LinAlgError

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭