在Python中建模时检测mulicollinear或具有线性组合的列:LinAlgError [英] Detecting mulicollinear , or columns that have linear combinations while modelling in Python : LinAlgError

查看:294
本文介绍了在Python中建模时检测mulicollinear或具有线性组合的列:LinAlgError的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在为具有34个因变量的logit模型建模数据,并且不断抛出奇异矩阵错误,如下所示:-

I am modelling data for a logit model with 34 dependent variables,and it keep throwing in the singular matrix error , as below -:

Traceback (most recent call last):
  File "<pyshell#1116>", line 1, in <module>
    test_scores  = smf.Logit(m['event'], train_cols,missing='drop').fit()
  File "/usr/local/lib/python2.7/site-packages/statsmodels-0.5.0-py2.7-linux-i686.egg/statsmodels/discrete/discrete_model.py", line 1186, in fit
    disp=disp, callback=callback, **kwargs)
  File "/usr/local/lib/python2.7/site-packages/statsmodels-0.5.0-py2.7-linux-i686.egg/statsmodels/discrete/discrete_model.py", line 164, in fit
    disp=disp, callback=callback, **kwargs)
  File "/usr/local/lib/python2.7/site-packages/statsmodels-0.5.0-py2.7-linux-i686.egg/statsmodels/base/model.py", line 357, in fit
    hess=hess)
  File "/usr/local/lib/python2.7/site-packages/statsmodels-0.5.0-py2.7-linux-i686.egg/statsmodels/base/model.py", line 405, in _fit_mle_newton
    newparams = oldparams - np.dot(np.linalg.inv(H),
  File "/usr/local/lib/python2.7/site-packages/numpy/linalg/linalg.py", line 445, in inv
    return wrap(solve(a, identity(a.shape[0], dtype=a.dtype)))
  File "/usr/local/lib/python2.7/site-packages/numpy/linalg/linalg.py", line 328, in solve
    raise LinAlgError, 'Singular matrix'
LinAlgError: Singular matrix

我是在什么时候才采用这种方法将矩阵简化为独立的列

Which was when I stumpled on this method to reduce the matrix to its independent columns

def independent_columns(A, tol = 0):#1e-05):
    """
    Return an array composed of independent columns of A.

    Note the answer may not be unique; this function returns one of many
    possible answers.

    https://stackoverflow.com/q/13312498/190597 (user1812712)
    http://math.stackexchange.com/a/199132/1140 (Gerry Myerson)
    http://mail.scipy.org/pipermail/numpy-discussion/2008-November/038705.html
        (Anne Archibald)

    >>> A = np.array([(2,4,1,3),(-1,-2,1,0),(0,0,2,2),(3,6,2,5)])
    2 4 1 3
    -1 -2 1 0
    0 0 2 2
    3 6 2 5
    # try with checking the rank of matrixs 
    >>> independent_columns(A)
    np.array([[1, 4],
              [2, 5],
              [3, 6]])
    """
    Q, R = linalg.qr(A)
    independent = np.where(np.abs(R.diagonal()) > tol)[0]
    #print independent
    return A[:, independent], independent


A,independent_col_indexes=independent_columns(train_cols.as_matrix(columns=None)) 
#train_cols will not be converted back from a df to a  matrix object,so doing this explicitly
A2=pd.DataFrame(A, columns=train_cols.columns[independent_col_indexes])

test_scores = smf.Logit(m['event'],A2,missing='drop').fit()

我仍然收到LinAlgError,尽管我希望现在矩阵级别降低.

I still get the LinAlgError , though I was hoping I will have the reduced matrix rank now.

此外,我看到np.linalg.matrix_rank(train_cols)返回33(即,在调用Independent_columns函数之前,总"x"列为34(即,len(train_cols.ix[0])=34),这意味着我没有完整的排名矩阵),而np.linalg.matrix_rank(A2)返回33(表示我已经删除了列,但仍然看到LinAlgError,当我运行test_scores = smf.Logit(m['event'],A2,missing='drop').fit()时,我缺少了什么?

Also, I see np.linalg.matrix_rank(train_cols) returns 33 (ie. before calling on the independent_columns function total "x" columns was 34(ie, len(train_cols.ix[0])=34 ), meaning I don't have a full rank matrix), while np.linalg.matrix_rank(A2) returns 33 (meaning I have dropped a columns, and yet I still see the LinAlgError , when I run test_scores = smf.Logit(m['event'],A2,missing='drop').fit() , what am I missing ?

参考上面的代码- 如何查找退化的协方差矩阵中的行/列

我试图通过一次引入每个变量来开始建立模型,这不会给我单一的矩阵误差,但是我宁愿有一个确定性的方法,让我知道我是什么.做错了如何消除这些列.

I tried to start building the model forward,by introducing each variable at a time, which doesn't give me the singular matrix error, but I would rather have a method that is deterministic, and lets me know, what am I doing wrong & how to eliminate these columns.

编辑(已通过@发布更新建议 下面的user333700)

Edit (updated post the suggestions by @ user333700 below)

1..您是对的,"A2"的排名没有降低到33. IE. len(A2.ix[0]) =34->表示可能的共线列未删除-我是否应将"tol"(获得A2的等级(及其列数)的公差)增加为33.如果将tol更改为"1e-05" ",那么我确实得到了len(A2.ix[0]) =33,这向我暗示tol> 0(严格来说)是一个指标. 在此之后,我只是执行了test_scores = smf.Logit(m['event'],A2,missing='drop').fit(),而没有进行nm收敛.

1. You are right, "A2" doesn't have the reduced rank of 33 . ie. len(A2.ix[0]) =34 -> meaning the possibly collinear columns are not dropped - should I increase the "tol", tolerance to get rank of A2 (and the numbers of columns thereof) , as 33. If I change the tol to "1e-05" above, then I do get len(A2.ix[0]) =33, which suggests to me that tol >0 (strictly) is one indicator. After this I just did the same, test_scores = smf.Logit(m['event'],A2,missing='drop').fit(), without nm to get the convergence.

2..尝试"nm"方法后出错.不过,奇怪的是,如果我只进行20,000行,我确实会得到结果.由于它没有显示出内存错误,而是"Inverting hessian failed, no bse or cov_params available"-我假设,有多个几乎相似的记录-您会说什么?

2. Errors post trying 'nm' method. Strange thing though is that if I take just 20,000 rows, I do get the results. Since it is not showing up Memory error, but "Inverting hessian failed, no bse or cov_params available" - I am assuming, there are multiple nearly-similar records - what would you say ?

m  = smf.Logit(data['event_custom'].ix[0:1000000] , train_cols.ix[0:1000000],missing='drop')
test_scores=m.fit(start_params=None,method='nm',maxiter=200,full_output=1)
Warning: Maximum number of iterations has been exceeded

Warning (from warnings module):
  File "/usr/local/lib/python2.7/site-packages/statsmodels-0.5.0-py2.7-linux-i686.egg/statsmodels/base/model.py", line 374
    warn(warndoc, Warning)
Warning: Inverting hessian failed, no bse or cov_params available


test_scores.summary()

Traceback (most recent call last):
  File "<pyshell#17>", line 1, in <module>
    test_scores.summary()
  File "/usr/local/lib/python2.7/site-packages/statsmodels-0.5.0-py2.7-linux-i686.egg/statsmodels/discrete/discrete_model.py", line 2396, in summary
    yname_list)
  File "/usr/local/lib/python2.7/site-packages/statsmodels-0.5.0-py2.7-linux-i686.egg/statsmodels/discrete/discrete_model.py", line 2253, in summary
    use_t=False)
  File "/usr/local/lib/python2.7/site-packages/statsmodels-0.5.0-py2.7-linux-i686.egg/statsmodels/iolib/summary.py", line 826, in add_table_params
    use_t=use_t)
  File "/usr/local/lib/python2.7/site-packages/statsmodels-0.5.0-py2.7-linux-i686.egg/statsmodels/iolib/summary.py", line 447, in summary_params
    std_err = results.bse
  File "/usr/local/lib/python2.7/site-packages/statsmodels-0.5.0-py2.7-linux-i686.egg/statsmodels/tools/decorators.py", line 95, in __get__
    _cachedval = self.fget(obj)
  File "/usr/local/lib/python2.7/site-packages/statsmodels-0.5.0-py2.7-linux-i686.egg/statsmodels/base/model.py", line 1037, in bse
    return np.sqrt(np.diag(self.cov_params()))
  File "/usr/local/lib/python2.7/site-packages/statsmodels-0.5.0-py2.7-linux-i686.egg/statsmodels/base/model.py", line 1102, in cov_params
    raise ValueError('need covariance of parameters for computing '
ValueError: need covariance of parameters for computing (unnormalized) covariances

(已通过以下@ user333700发布了建议)

Edit 2: (updated post the suggestions by @user333700 below)

重申我要建模的内容-不到总数的1% 用户转化"(成功的结果)-因此,我采取了均衡的 35(+ ve)/65(-ve)

Reiterating what I am trying to model - less than about 1% of total users "convert" (success outcomes) - so I took a balanced sample of 35(+ve) /65 (-ve)

我怀疑该模型虽然收敛,但并不健壮.因此,将使用"start_params"作为来自不同数据集的早期迭代的参数. 此编辑是关于确认是否可以将"start_params"输入到结果中,如下所示:

I suspect the model is not robust, though it converges. So, will use "start_params" as the params from earlier iteration, from a different dataset. This edit is about confirming is the "start_params" can feed into the results as below -:

A,independent_col_indexes=independent_columns(train_cols.as_matrix(columns=None))
A2=pd.DataFrame(A, columns=train_cols.columns[independent_col_indexes])
m  = smf.Logit(data['event_custom'], A2,missing='drop')
#m  = smf.Logit(data['event_custom'], train_cols,missing='drop')#,method='nm').fit()#This doesnt work, so tried 'nm' which work, but used lasso, as nm did not converge.
test_scores=m.fit_regularized(start_params=None, method='l1', maxiter='defined_by_method', full_output=1, disp=1, callback=None, alpha=0, \
trim_mode='auto', auto_trim_tol=0.01, size_trim_tol=0.0001, qc_tol=0.03)

a_good_looking_previous_result.params=test_scores.params #storing the parameters of pass1 to feed into pass2

test_scores.params
bidfloor_Quartile_modified_binned_0               0.305765
connectiontype_binned_0                          -0.436798
day_custom_binned_Fri                            -0.040269
day_custom_binned_Mon                             0.138599
day_custom_binned_Sat                            -0.319997
day_custom_binned_Sun                            -0.236507
day_custom_binned_Thu                            -0.058922
user_agent_device_family_binned_iPad            -10.793270
user_agent_device_family_binned_iPhone           -8.483099
user_agent_masterclass_binned_apple               9.038889
user_agent_masterclass_binned_generic            -0.760297
user_agent_masterclass_binned_samsung            -0.063522
log_height_width                                  0.593199
log_height_width_ScreenResolution                -0.520836
productivity                                     -1.495373
games                                             0.706340
entertainment                                    -1.806886
IAB24                                             2.531467
IAB17                                             0.650327
IAB14                                             0.414031
utilities                                         9.968253
IAB1                                              1.850786
social_networking                                -2.814148
IAB3                                             -9.230780
music                                             0.019584
IAB9                                             -0.415559
C(time_day_modified)[(6, 12]]:C(country)[AUS]    -0.103003
C(time_day_modified)[(0, 6]]:C(country)[HKG]      0.769272
C(time_day_modified)[(6, 12]]:C(country)[HKG]     0.406882
C(time_day_modified)[(0, 6]]:C(country)[IDN]      0.073306
C(time_day_modified)[(6, 12]]:C(country)[IDN]    -0.207568
C(time_day_modified)[(0, 6]]:C(country)[IND]      0.033370
... more params here 

现在在不同的数据集(pass2,用于索引)上,我的模型如下所示: IE.我读取了一个新的数据框,进行了所有变量转换,然后通过Logit进行了建模(如前所述).

Now on a different dataset(pass2, for indexing), I model the same as below -: ie. I read a new dataframe, do all the variable transformation and then model via Logit as earlier .

m_pass2  = smf.Logit(data['event_custom'], A2_pass2,missing='drop')
test_scores_pass2=m_pass2.fit_regularized(start_params=a_good_looking_previous_result.params, method='l1', maxiter='defined_by_method', full_output=1, disp=1, callback=None, alpha=0, \
trim_mode='auto', auto_trim_tol=0.01, size_trim_tol=0.0001, qc_tol=0.03)

,并可能通过从较早的遍历中获取"start_params"来保持迭代.

and, possibly keep iterating by picking up "start_params" from earlier passes.

推荐答案

以下几点:

您需要tol> 0才能检测接近完美的共线性,这也可能在以后的计算中引起数值问题. 检查A2的列数,以查看是否确实删除了某列.

You need tol > 0 to detect near perfect collinearity, which might also cause numerical problems in later calculations. Check the number of columns of A2 to see whether a column has really be dropped.

Logit需要使用exog进行一些非线性计算,因此,即使设计矩阵与完美共线性不是很接近,对数似然,导数或Hessian计算的转换变量仍可能最终采用数值形式问题,例如奇异的黑森州.

Logit needs to do some non-linear calculations with the exog, so even if the design matrix is not very close to perfect collinearity, the transformed variables for the log-likelihood, derivative or Hessian calculations might still end up being with numerical problems, like singular Hessian.

(当我们接近浮点精度1e-15、1e-16时,所有这些都是浮点问题.有时,matrix_rank和类似的linalg函数的默认阈值有所不同,这可能意味着在某些情况下,一个函数将其标识为单数,而另一个则不然.)

(All these are floating point problems when we work near floating point precision, 1e-15, 1e-16. There are sometimes differences in the default thresholds for matrix_rank and similar linalg functions which can imply that in some edge cases one function identifies it as singular and another one doesn't.)

包括Logit在内的离散模型的默认优化方法是一种简单的Newton方法,该方法在相当好的情况下速度很快,但在条件恶劣的情况下可能会失败.您可以尝试使用其他优化器之一,该优化器将是scipy中的优化器之一.optimize,method='nm'通常非常健壮,但运行缓慢,method='bfgs'在许多情况下都能很好地工作,但也会遇到收敛性问题.

The default optimization method for the discrete models including Logit is a simple Newton method, which is fast in reasonably nice cases, but can fail in cases that are badly conditioned. You could try one of the other optimizers which will be one of those in scipy.optimize, method='nm' is usually very robust but slow, method='bfgs' works well in many cases but also can run into convergence problems.

尽管如此,即使其他优化方法之一成功了,仍然有必要检查结果.通常,一种方法的失败意味着可能无法很好地定义模型或估计问题.

Nevertheless, even when one of the other optimization methods succeeds, it is still necessary to inspect the results. More often than not, a failure with one method means that the model or estimation problem might not be well defined.

检查是仅仅是起始值不正确还是规范问题的一个好方法是先运行method='nm',然后使用nm运行更准确的方法之一,例如newtonbfgs >将其估算为起始值,然后查看它是否从良好的起始值中获得成功.

A good way to check whether it is just a problem with bad starting values or a specification problem is to run method='nm' first and then run one of the more accurate methods like newton or bfgs using the nm estimate as starting value, and see whether it succeeds from good starting values.

这篇关于在Python中建模时检测mulicollinear或具有线性组合的列:LinAlgError的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆