ConvergenceWarning:Liblinear 收敛失败,增加迭代次数 [英] ConvergenceWarning: Liblinear failed to converge, increase the number of iterations

查看:70
本文介绍了ConvergenceWarning:Liblinear 收敛失败,增加迭代次数的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

为 Adrian 运行线性二进制模式的代码.该程序运行但给出以下警告:

Running the code of linear binary pattern for Adrian. This program runs but gives the following warning:

C:Python27libsite-packagessklearnsvmase.py:922: ConvergenceWarning: Liblinear failed to converge, increase the number of iterations.
 "the number of iterations.", ConvergenceWarning

我正在用 opencv3.7 运行 python2.7,我该怎么办?

I am running python2.7 with opencv3.7, what should I do?

推荐答案

通常,当优化算法不收敛时,通常是因为问题的条件不好,可能是由于决策变量的缩放不当.您可以尝试一些方法.

Normally when an optimization algorithm does not converge, it is usually because the problem is not well-conditioned, perhaps due to a poor scaling of the decision variables. There are a few things you can try.

  1. 标准化您的训练数据,以便问题有望变得更好有条件的,这反过来又可以加速收敛.一可能性是使用Scikit-Learn 的标准缩放器举个例子.请注意,您必须将在训练数据上拟合的 StandardScaler 应用于测试数据.此外,如果您有离散特征,请确保它们被正确转换,以便对它们进行缩放.
  2. 与 1) 相关,确保其他参数如正则化权重,C,被适当地设置.C 必须是 >0. 通常,在微调之前,人们会以对数标度 (1e-5, 1e-4, 1e-3, ..., 1, 10, 100, ...) 尝试各种 C 值它在特定间隔内以更细的粒度.现在,使用诸如 Scikit 之类的包来调整参数可能更有意义,例如使用贝叶斯优化- 优化.
  3. max_iter 设置为更大的值.默认值为 1000.这应该是您的最后手段.如果优化过程在前 1000 次迭代中没有收敛,则通过设置更大的 max_iter 使其收敛通常会掩盖其他问题,例如 1) 和 2) 中描述的问题.它甚至可能表明您有一些合适的特征或特征中有很强的相关性.在采取这种简单的方法之前先调试那些.
  4. 如果特征数量>,则设置dual = True示例数量,反之亦然.这使用对偶公式解决了 SVM 优化问题.感谢 @Nino van Hooff 指出这一点,感谢 @JamesKo 发现我的错误.
  5. 使用不同的求解器,例如,如果您使用 Logistic 回归,则使用 L-BFGS 求解器.请参阅 @5ervant 的回答.
  1. Normalize your training data so that the problem hopefully becomes more well conditioned, which in turn can speed up convergence. One possibility is to scale your data to 0 mean, unit standard deviation using Scikit-Learn's StandardScaler for an example. Note that you have to apply the StandardScaler fitted on the training data to the test data. Also, if you have discrete features, make sure they are transformed properly so that scaling them makes sense.
  2. Related to 1), make sure the other arguments such as regularization weight, C, is set appropriately. C has to be > 0. Typically one would try various values of C in a logarithmic scale (1e-5, 1e-4, 1e-3, ..., 1, 10, 100, ...) before finetuning it at finer granularity within a particular interval. These days, it probably make more sense to tune parameters using, for e.g., Bayesian Optimization using a package such as Scikit-Optimize.
  3. Set max_iter to a larger value. The default is 1000. This should be your last resort. If the optimization process does not converge within the first 1000 iterations, having it converge by setting a larger max_iter typically masks other problems such as those described in 1) and 2). It might even indicate that you have some in appropriate features or strong correlations in the features. Debug those first before taking this easy way out.
  4. Set dual = True if number of features > number of examples and vice versa. This solves the SVM optimization problem using the dual formulation. Thanks @Nino van Hooff for pointing this out, and @JamesKo for spotting my mistake.
  5. Use a different solver, for e.g., the L-BFGS solver if you are using Logistic Regression. See @5ervant's answer.

注意:不应忽略此警告.

出现这个警告是因为

  1. 解决线性 SVM 只是解决二次优化问题.求解器通常是一种迭代算法,它保持对解的运行估计(即 SVM 的权重和偏差).当解对应于该凸优化问题的最优目标值时,或者当它达到最大迭代次数集时,它停止运行.

  1. Solving the linear SVM is just solving a quadratic optimization problem. The solver is typically an iterative algorithm that keeps a running estimate of the solution (i.e., the weight and bias for the SVM). It stops running when the solution corresponds to an objective value that is optimal for this convex optimization problem, or when it hits the maximum number of iterations set.

如果算法不收敛,则不能保证 SVM 参数的当前估计是好的,因此预测也可能是完全垃圾.

If the algorithm does not converge, then the current estimate of the SVM's parameters are not guaranteed to be any good, hence the predictions can also be complete garbage.

编辑

此外,请考虑 @Nino van Hooff@5ervant 使用 SVM 的对偶公式.如果您拥有的特征数量 D 多于训练示例的数量 N,这一点尤其重要.这就是 SVM 的对偶公式是专门为优化问题的条件而设计的.感谢 @5ervant 注意到并指出这一点.

In addition, consider the comment by @Nino van Hooff and @5ervant to use the dual formulation of the SVM. This is especially important if the number of features you have, D, is more than the number of training examples N. This is what the dual formulation of the SVM is particular designed for and helps with the conditioning of the optimization problem. Credit to @5ervant for noticing and pointing this out.

此外,@5ervant 还指出了改变求解器的可能性,尤其是使用 L-BFGS求解器.归功于他(即,赞成他的回答,而不是我的).

Furthermore, @5ervant also pointed out the possibility of changing the solver, in particular the use of the L-BFGS solver. Credit to him (i.e., upvote his answer, not mine).

我想为感兴趣的人提供一个粗略的解释(我是:))为什么这在这种情况下很重要.二阶方法,特别是像 L-BFGS 求解器这样的近似二阶方法,将有助于解决病态问题,因为它在每次迭代时逼近 Hessian 并使用它来缩放梯度方向.这允许它获得更好的收敛,但每次迭代的计算成本可能更高.也就是说,完成所需的迭代次数更少,但每次迭代都会比典型的一阶方法(如梯度下降或其变体)慢.

I would like to provide a quick rough explanation for those who are interested (I am :)) why this matters in this case. Second-order methods, and in particular approximate second-order method like the L-BFGS solver, will help with ill-conditioned problems because it is approximating the Hessian at each iteration and using it to scale the gradient direction. This allows it to get better convergence rate but possibly at a higher compute cost per iteration. That is, it takes fewer iterations to finish but each iteration will be slower than a typical first-order method like gradient-descent or its variants.

例如,典型的一阶方法可能会在每次迭代时更新解决方案,如

For e.g., a typical first-order method might update the solution at each iteration like

x(k + 1) = x(k) - alpha(k) * 梯度(f(x(k)))

x(k + 1) = x(k) - alpha(k) * gradient(f(x(k)))

其中 alpha(k),即迭代 k 的步长,取决于算法的特定选择或学习率计划.

where alpha(k), the step size at iteration k, depends on the particular choice of algorithm or learning rate schedule.

二阶方法,例如牛顿,将有一个更新方程

A second order method, for e.g., Newton, will have an update equation

x(k + 1) = x(k) - alpha(k) * Hessian(x(k))^(-1) * gradient(f(x(k)))

x(k + 1) = x(k) - alpha(k) * Hessian(x(k))^(-1) * gradient(f(x(k)))

也就是说,它使用 Hessian 中编码的局部曲率信息来相应地缩放梯度.如果问题是病态的,梯度将指向不太理想的方向,而反向 Hessian 缩放将有助于纠正这一点.

That is, it uses the information of the local curvature encoded in the Hessian to scale the gradient accordingly. If the problem is ill-conditioned, the gradient will be pointing in less than ideal directions and the inverse Hessian scaling will help correct this.

特别是 @5ervant 的回答中提到的 L-BFGS 是一种近似 Hessian 逆的方法因为计算它可能是一项昂贵的操作.

In particular, L-BFGS mentioned in @5ervant's answer is a way to approximate the inverse of the Hessian as computing it can be an expensive operation.

然而,二阶方法的收敛速度可能比一阶方法快得多(即需要更少的迭代),比如通常的基于梯度下降的求解器,正如你们现在所知,有时甚至无法收敛.这可以补偿每次迭代所花费的时间.

However, second-order methods might converge much faster (i.e., requires fewer iterations) than first-order methods like the usual gradient-descent based solvers, which as you guys know by now sometimes fail to even converge. This can compensate for the time spent at each iteration.

总而言之,如果您有一个条件良好的问题,或者如果您可以通过其他方式使其条件良好,例如使用正则化和/或特征缩放和/或确保您的示例多于特征,您可能不必使用二阶方法.但是如今,随着许多模型优化非凸问题(例如,DL 模型中的模型),L-BFGS 方法等二阶方法在那里扮演着不同的角色,并且有证据表明,与一阶方法相比,它们有时可以找到更好的解决方案订购方法.但那是另外一回事了.

In summary, if you have a well-conditioned problem, or if you can make it well-conditioned through other means such as using regularization and/or feature scaling and/or making sure you have more examples than features, you probably don't have to use a second-order method. But these days with many models optimizing non-convex problems (e.g., those in DL models), second order methods such as L-BFGS methods plays a different role there and there are evidence to suggest they can sometimes find better solutions compared to first-order methods. But that is another story.

这篇关于ConvergenceWarning:Liblinear 收敛失败,增加迭代次数的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆