逻辑回归python求解器的定义 [英] Logistic regression python solvers' definitions

查看:58
本文介绍了逻辑回归python求解器的定义的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用 sklearn 的逻辑回归函数,并想知道每个求解器实际上在幕后做什么来解决优化问题.

谁能简要描述一下什么是newton-cg"、sag"、lbfgs"?和liblinear"在做什么?

解决方案

好吧,我希望我参加聚会还不算太晚!在挖掘大量信息之前,让我先尝试建立一些直觉(警告:这不是简单的比较)


简介

一个假设h(x),接受一个输入并给我们估计的输出值.

这个假设可以是一个简单的单变量线性方程,也可以是一个非常复杂和长的多元方程,相对于我们使用的算法类型(即线性回归、逻辑回归..等).

我们的任务是找到最佳参数(又名 Thetas 或权重),为我们提供最小误差 预测输出.我们将此错误称为成本或损失函数,显然我们的目标是最小化,以便获得最佳预测输出!

还有一件事要记住,参数值与其对成本函数的影响(即误差)之间的关系看起来像一个钟形曲线(即二次方;记住这一点,因为它非常重要).

因此,如果我们从该曲线中的任何一点开始,如果我们继续取我们停止的每个点的导数(即切线),我们最终会得到所谓的全局最优如图所示:

如果我们在最小成本点(即全局最优)处取偏导数,我们会找到切线的斜率 = 0(然后我们知道我们达到了目标).

这仅在我们有成本函数时才有效,但如果我们没有,我们最终可能会陷入所谓的局部最优;考虑这个非凸函数:

现在您应该对我们正在做的事情与术语之间的 hack 关系有了直觉:DeravativeTangent LineCost Function, 假设 ..等

旁注:上述直觉也与梯度下降算法有关(见下文).


背景

线性近似:

给定一个函数,f(x),我们可以在x=a处找到它的切线.切线L(x)的方程为:L(x)=f(a)+f'(a)(x−a).

看看下面的函数及其切线图:

从这个图中我们可以看到,在x=a附近,切线和函数具有几乎相同的图形.有时我们会使用切线 L(x) 作为函数的近似值,f(x),靠近 x=a>.在这些情况下,我们将切线称为 x=a 处函数的线性近似.

二次近似:

与线性近似相同,但这次我们处理的是一条曲线,但我们无法通过使用切线找到0附近的点.

相反,我们使用抛物线(这是一条曲线,其中任何点与固定点或固定直线的距离相等),如下所示:

为了拟合一个好的抛物线,抛物线和二次函数应该具有相同的值,相同的一阶导数和二阶导数,......公式将是(出于好奇): Qa(x) = f(a) + f'(a)(xa) + f''(a)(xa)2/2

现在我们应该准备好进行详细的比较了.


方法比较

1.牛顿法

回忆一下 x 处梯度下降步骤的动机:我们最小化二次函数(即成本函数).

牛顿方法在某种意义上使用了更好二次函数最小化.更好,因为它使用二次近似(即一阶 AND 二阶 偏导数).

你可以把它想象成一个带有 Hessian 的扭曲梯度下降(Hessian 是 nxn 阶二阶偏导数的方阵).

此外,牛顿方法的几何解释是,在每次迭代中,通过围绕 xn 的二次函数近似 f(x),然后朝着该二次函数的最大值/最小值(在更高维度中,这也可能是一个鞍点).请注意,如果 f(x) 恰好是一个二次函数,那么在一步中找到确切的极值.

缺点:

  1. 由于 Hessian 矩阵(即二阶偏导数计算),计算量昂贵.

  2. 它吸引了鞍点,这在多变量优化中很常见(即它的偏导数不同意这个输入应该是最大值还是最小值的点)点!).

2.有限内存 Broyden-Fletcher-Goldfarb-Shanno 算法:

简而言之,它类似于牛顿法,但这里的 Hessian 矩阵是近似,使用梯度评估(或近似梯度评估)指定的更新.换句话说,使用对 Hessian 逆矩阵的估计.

有限内存"一词仅表示它仅存储一些隐式表示近似值的向量.

如果我敢说当数据集时,L-BFGS相对于其他方法表现最好,特别是它节省了大量内存,但是也有一些严重"的缺点,如果它不受保护,它可能不会收敛到任何东西.

附注:自 0.22 版以来,此求解器已成为 sklearn LogisticRegression 中的默认求解器,取代了 LIBLINEAR.

3.大型线性分类库:

它是一种支持逻辑回归和线性支持向量机的线性分类(线性分类器通过基于特征的线性组合的值(即特征值)做出分类决策来实现这一点).

求解器使用坐标下降 (CD) 算法,通过沿坐标方向或坐标超平面连续执行近似最小化来解决优化问题.

LIBLINEAR 是 ICML 2008 大规模学习挑战赛的获胜者.它应用自动参数选择(又名 L1 正则化),当您有高维数据集时推荐使用(推荐用于解决大规模分类问题)

缺点:

  1. 如果函数的水平曲线不平滑,它可能会卡在非平稳点(即非最优).

  2. 也不能并行运行.

  3. 它无法学习真正的多项式(多类)模型;相反,优化问题以one-vs-rest"方式分解,因此针对所有类别训练单独的二元分类器.

旁注:根据 Scikit 文档:liblinear"求解器是 0.22 版本之前的历史原因默认使用的求解器.从那时起,默认使用的是有限内存 Broyden–Fletcher–Goldfarb–Shanno 算法.

4.随机平均梯度:

SAG 方法优化有限数量的平滑凸函数的总和.与随机梯度 (SG) 方法一样,SAG 方法的迭代成本与总和中的项数无关.然而,通过结合先前梯度值的记忆,SAG 方法实现了比黑盒 SG 方法更快的收敛速度.

对于大型数据集,当样本数量和特征数量都很大时,它比其他求解器更快.

缺点:

  1. 它只支持 L2 惩罚.

  2. 它的内存成本为 O(N),这使得它对于大 N 来说不切实际(因为它记住了大约所有梯度的最近计算值).

5.佐贺:

SAGA 求解器是 SAG 的变体,它也支持非平滑的 penalty=l1 选项(即 L1 正则化).因此,这是稀疏多项逻辑回归的首选求解器,也适用于非常大数据集.>

旁注:根据 Scikit 文档:SAGA 求解器通常是最佳选择.


总结

下表摘自

I am using the logistic regression function from sklearn, and was wondering what each of the solver is actually doing behind the scenes to solve the optimization problem.

Can someone briefly describe what "newton-cg", "sag", "lbfgs" and "liblinear" are doing?

解决方案

Well, I hope I'm not too late to the party! Let me first try to establish some intuition before digging in loads of information (warning: this is not brief comparison)


Introduction

A hypothesis h(x), takes an input and gives us the estimated output value.

This hypothesis can be a as simple as a one variable linear equation, .. up to a very complicated and long multivariate equation with respect to the type of the algorithm we’re using (i.e. linear regression, logistic regression..etc).

Our task is to find the best Parameters (a.k.a Thetas or Weights) that give us the least error in predicting the output. We call this error a Cost or Loss Function and apparently our goal is to minimize it in order to get the best predicted output!

One more thing to recall, that the relation between the parameter value and its effect on the cost function (i.e. the error) looks like a bell curve (i.e. Quadratic; recall this because it’s very important) .

So if we start at any point in that curve and if we keep taking the derivative (i.e. tangent line) of each point we stop at, we will end up at what so called the Global Optima as shown in this image:

If we take the partial derivative at minimum cost point (i.e. global optima) we find the slope of the tangent line = 0 (then we know that we reached our target).

That’s valid only if we have Convex Cost Function, but if we don’t, we may end up stuck at what so called Local Optima; consider this non-convex function:

Now you should have the intuition about the hack relationship between what we are doing and the terms: Deravative, Tangent Line, Cost Function, Hypothesis ..etc.

Side Note: The above mentioned intuition also related to the Gradient Descent Algorithm (see later).


Background

Linear Approximation:

Given a function, f(x), we can find its tangent at x=a. The equation of the tangent line L(x) is: L(x)=f(a)+f′(a)(x−a).

Take a look at the following graph of a function and its tangent line:

From this graph we can see that near x=a, the tangent line and the function have nearly the same graph. On occasion we will use the tangent line, L(x), as an approximation to the function, f(x), near x=a. In these cases we call the tangent line the linear approximation to the function at x=a.

Quadratic Approximation:

Same like linear approximation but this time we are dealing with a curve but we cannot find the point near to 0 by using the tangent line.

Instead, we use a parabola (which is a curve where any point is at an equal distance from a fixed point or a fixed straight line), like this:

And in order to fit a good parabola, both parabola and quadratic function should have same value, same first derivative, AND second derivative, ... the formula will be (just out of curiosity): Qa(x) = f(a) + f'(a)(x-a) + f''(a)(x-a)2/2

Now we should be ready to do the comparison in details.


Comparison between the methods

1. Newton’s Method

Recall the motivation for gradient descent step at x: we minimize the quadratic function (i.e. Cost Function).

Newton’s method uses in a sense a better quadratic function minimisation. A better because it uses the quadratic approximation (i.e. first AND second partial derivatives).

You can imagine it as a twisted Gradient Descent with The Hessian (The Hessian is a square matrix of second-order partial derivatives of order nxn).

Moreover, the geometric interpretation of Newton's method is that at each iteration one approximates f(x) by a quadratic function around xn, and then takes a step towards the maximum/minimum of that quadratic function (in higher dimensions, this may also be a saddle point). Note that if f(x) happens to be a quadratic function, then the exact extremum is found in one step.

Drawbacks:

  1. It’s computationally expensive because of The Hessian Matrix (i.e. second partial derivatives calculations).

  2. It attracts to Saddle Points which are common in multivariable optimization (i.e. a point its partial derivatives disagree over whether this input should be a maximum or a minimum point!).

2. Limited-memory Broyden–Fletcher–Goldfarb–Shanno Algorithm:

In a nutshell, it is analogue of the Newton’s Method but here the Hessian matrix is approximated using updates specified by gradient evaluations (or approximate gradient evaluations). In other words, using an estimation to the inverse Hessian matrix.

The term Limited-memory simply means it stores only a few vectors that represent the approximation implicitly.

If I dare say that when dataset is small, L-BFGS relatively performs the best compared to other methods especially it saves a lot of memory, however there are some "serious" drawbacks such that if it is unsafeguarded, it may not converge to anything.

Side note: This solver has became the default solver in sklearn LogisticRegression since version 0.22, replacing LIBLINEAR.

3. A Library for Large Linear Classification:

It’s a linear classification that supports logistic regression and linear support vector machines (A linear classifier achieves this by making a classification decision based on the value of a linear combination of the characteristics i.e feature value).

The solver uses a coordinate descent (CD) algorithm that solves optimization problems by successively performing approximate minimization along coordinate directions or coordinate hyperplanes.

LIBLINEAR is the winner of ICML 2008 large-scale learning challenge. It applies Automatic parameter selection (a.k.a L1 Regularization) and it’s recommended when you have high dimension dataset (recommended for solving large-scale classification problems)

Drawbacks:

  1. It may get stuck at a non-stationary point (i.e. non-optima) if the level curves of a function are not smooth.

  2. Also cannot run in parallel.

  3. It cannot learn a true multinomial (multiclass) model; instead, the optimization problem is decomposed in a "one-vs-rest" fashion so separate binary classifiers are trained for all classes.

Side note: According to Scikit Documentation: The "liblinear" solver was the one used by default for historical reasons before version 0.22. Since then, default use is Limited-memory Broyden–Fletcher–Goldfarb–Shanno Algorithm.

4. Stochastic Average Gradient:

SAG method optimizes the sum of a finite number of smooth convex functions. Like stochastic gradient (SG) methods, the SAG method's iteration cost is independent of the number of terms in the sum. However, by incorporating a memory of previous gradient values the SAG method achieves a faster convergence rate than black-box SG methods.

It is faster than other solvers for large datasets, when both the number of samples and the number of features are large.

Drawbacks:

  1. It only supports L2 penalization.

  2. Its memory cost of O(N), which can make it impractical for large N (because it remembers the most recently computed values for approx. all gradients).

5. SAGA:

The SAGA solver is a variant of SAG that also supports the non-smooth penalty=l1 option (i.e. L1 Regularization). This is therefore the solver of choice for sparse multinomial logistic regression and it’s also suitable very Large dataset.

Side note: According to Scikit Documentation: The SAGA solver is often the best choice.


Summary

The following table is taken from Scikit Documentation

这篇关于逻辑回归python求解器的定义的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆