逻辑回归python求解器的定义 [英] logistic regression python solvers' defintions

查看:88
本文介绍了逻辑回归python求解器的定义的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用sklearn的逻辑回归函数,想知道每个求解器在后台实际上在做什么以解决优化问题.

I am using the logistic regression function from sklearn, and was wondering what each of the solver is actually doing behind the scenes to solve the optimization problem.

有人可以简要描述"newton-cg","sag","lbfgs"和"liblinear"在做什么吗?如果没有,那么也非常感谢任何相关链接或阅读材料.

Can someone briefly describe what "newton-cg", "sag", "lbfgs" and "liblinear" are doing? If not, any related links or reading materials are much appreciated too.

非常感谢.

推荐答案

好吧,我希望参加聚会的时间不要太晚!让我先尝试建立一些直觉,然后再挖掘大量信息( 警告:这不是简短的比较)

Well, I hope I'm not too late to the party! Let me first try to establish some intuition before digging in loads of information (warning: this is not brief comparison)

假设h(x),接受输入,然后为我们提供估计的输出值.

A hypothesis h(x), takes an input and gives us the estimated output value.

这个假设可以很简单,就像一个变量线性方程,..就我们所使用的算法类型而言,直到一个非常复杂和冗长的多元方程(即线性回归,逻辑回归) ..etc ).

This hypothesis can be a as simple as a one variable linear equation, .. up to a very complicated and long multivariate equation with respect to the type of the algorithm we’re using (i.e. linear regression, logistic regression..etc).

我们的任务是找到 最佳参数 (又称Thetas或Weights),从而为我们提供 最小误差 在预测输出中.我们将此错误称为 Cost or Loss Function (成本或损失函数) ,显然,我们的目标是 最小化 以获取该错误.最佳预测输出!

Our task is to find the best Parameters (a.k.a Thetas or Weights) that give us the least error in predicting the output. We call this error a Cost or Loss Function and apparently our goal is to minimize it in order to get the best predicted output!

还需要回忆一下,参数值及其对成本函数的影响(即误差)之间的关系看起来像 钟形曲线 (即 Quadratic ;请记住这一点,因为它非常重要).

One more thing to recall, that the relation between the parameter value and its effect on the cost function (i.e. the error) looks like a bell curve (i.e. Quadratic; recall this because it’s very important) .

因此,如果我们从该曲线的任意点开始并且继续获取停靠点的导数(即切线),那么我们将以所谓的全局最优值结束如下图所示:

So if we start at any point in that curve and if we keep taking the derivative (i.e. tangent line) of each point we stop at, we will end up at what so called the Global Optima as shown in this image:

如果我们以最小成本点(即全局最优值)取偏导数,则会发现切线的 斜率 = 0 (那么我们知道我们已经达到了目标).

If we take the partial derivative at minimum cost point (i.e. global optima) we find the slope of the tangent line = 0 (then we know that we reached our target).

这仅在具有成本函数的情况下有效,但如果没有,我们可能最终会陷入所谓的 局部最优 强>;考虑以下非凸函数:

That’s valid only if we have Convex Cost Function, but if we don’t, we may end up stuck at what so called Local Optima; consider this non-convex function:

现在,您应该对我们正在做的事情与以下术语之间的hack关系有直观的认识:衍生切线成本函数假设 ..etc.

Now you should have the intuition about the hack relationship between what we are doing and the terms: Deravative, Tangent Line, Cost Function, Hypothesis ..etc.

侧面说明:上述直觉也与梯度下降"算法(请参阅下文)有关.

线性近似:

给定一个函数f(x),我们可以在x=a处找到它的切线.切线L(x)的等式为:L(x)=f(a)+f′(a)(x−a).

Given a function, f(x), we can find its tangent at x=a. The equation of the tangent line L(x) is: L(x)=f(a)+f′(a)(x−a).

看看下面的函数图及其切线:

Take a look at the following graph of a function and its tangent line:

从该图我们可以看到,在x=a附近,切线和函数几乎具有相同的图.有时,我们会使用切线L(x)作为函数f(x)x=a附近的近似值.在这种情况下,我们将切线称为x=a处函数的线性近似.

From this graph we can see that near x=a, the tangent line and the function have nearly the same graph. On occasion we will use the tangent line, L(x), as an approximation to the function, f(x), near x=a. In these cases we call the tangent line the linear approximation to the function at x=a.

二次近似:

类似于线性逼近,但是这次我们处理的是曲线,但我们无法使用切线找到 0 附近的点.

Same like linear approximation but this time we are dealing with a curve but we cannot find the point near to 0 by using the tangent line.

相反,我们使用抛物线(是一条曲线,其中任何点都距固定点或固定直线的距离相等),如下所示:

Instead, we use a parabola (which is a curve where any point is at an equal distance from a fixed point or a fixed straight line), like this:

并且为了拟合良好的抛物线,抛物线和二次函数都应具有相同的值,相同的一阶导数和二阶导数,...的公式为(出于好奇而已) :Qa(x) = f(a) + f'(a)(x-a) + f''(a)(x-a)2/2

And in order to fit a good parabola, both parabola and quadratic function should have same value, same first derivative, AND second derivative, ... the formula will be (just out of curiosity): Qa(x) = f(a) + f'(a)(x-a) + f''(a)(x-a)2/2

现在我们应该准备进行详细的比较.

1.牛顿法

回想一下x处梯度下降步骤的动机:我们最小化了二次函数(即成本函数).

Recall the motivation for gradient descent step at x: we minimize the quadratic function (i.e. Cost Function).

牛顿的方法在某种意义上使用了 更好 的二次函数最小化. 更好,因为它使用二次逼近(即一阶和第二偏导数).

Newton’s method uses in a sense a better quadratic function minimisation. A better because it uses the quadratic approximation (i.e. first AND second partial derivatives).

您可以将其想象为Hessian( Hessian是nxn阶二阶偏导数的方阵)的扭曲梯度下降.

You can imagine it as a twisted Gradient Descent with The Hessian (The Hessian is a square matrix of second-order partial derivatives of order nxn).

此外,牛顿法的几何解释是,在每次迭代中,一个f(x)都由一个围绕xn的二次函数逼近,然后朝那个二次函数的最大值/最小值迈进(在更高的维度上,也可能是一个鞍点).请注意,如果f(x)恰好是二次函数,则一步就可以找到确切的极值.

Moreover, the geometric interpretation of Newton's method is that at each iteration one approximates f(x) by a quadratic function around xn, and then takes a step towards the maximum/minimum of that quadratic function (in higher dimensions, this may also be a saddle point). Note that if f(x) happens to be a quadratic function, then the exact extremum is found in one step.

缺点:

  1. 由于Hessian矩阵(即二阶偏导数计算),它的计算结果 昂贵 .

  1. It’s computationally expensive because of The Hessian Matrix (i.e. second partial derivatives calculations).

它吸引了 鞍点 (在多变量优化中很常见)(即,其偏导数在此输入应为最大值还是最小值上存在分歧的点)点!).

It attracts to Saddle Points which are common in multivariable optimization (i.e. a point its partial derivatives disagree over whether this input should be a maximum or a minimum point!).

2.内存有限的Broyden–Fletcher–Goldfarb–Shanno算法:

简而言之,它类似于牛顿法,但是这里的Hessian矩阵使用梯度评估(或近似梯度评估)指定的更新 近似 .换句话说,使用逆Hessian矩阵的估计.

In a nutshell, it is analogue of the Newton’s Method but here the Hessian matrix is approximated using updates specified by gradient evaluations (or approximate gradient evaluations). In other words, using an estimation to the inverse Hessian matrix.

术语有限内存"仅表示它只存储一些隐式表示近似值的向量.

The term Limited-memory simply means it stores only a few vectors that represent the approximation implicitly.

如果我敢说,当数据集 时,与其他方法相比,L-BFGS的性能最好,尤其是它节省了大量内存,但是还是有一些" serious "的缺点是,如果不受保护,它可能不会收敛到任何东西.

If I dare say that when dataset is small, L-BFGS relatively performs the best compared to other methods especially it saves a lot of memory, however there are some "serious" drawbacks such that if it is unsafeguarded, it may not converge to anything.

3.大型线性分类库:

这是支持逻辑回归和线性支持向量机的线性分类(线性分类器通过基于特征的线性组合值即特征值做出分类决策来实现此目标).

It’s a linear classification that supports logistic regression and linear support vector machines (A linear classifier achieves this by making a classification decision based on the value of a linear combination of the characteristics i.e feature value).

求解器使用坐标下降(CD)算法,该算法通过沿坐标方向或坐标超平面连续执行近似最小化来解决优化问题.

The solver uses a coordinate descent (CD) algorithm that solves optimization problems by successively performing approximate minimization along coordinate directions or coordinate hyperplanes.

LIBLINEAR是ICML 2008大规模学习挑战赛的获胜者.它适用自动参数选择(也称为L1正则化),建议您在具有高维数据集时使用(建议用于解决大规模分类问题)

LIBLINEAR is the winner of ICML 2008 large-scale learning challenge. It applies Automatic parameter selection (a.k.a L1 Regularization) and it’s recommended when you have high dimension dataset (recommended for solving large-scale classification problems)

缺点:

  1. 如果函数的电平曲线不平滑,它可能会卡在非平稳点(即非最佳点)上.

也不能并行运行.

它无法学习真正的多项式(多类)模型;取而代之的是,优化问题以一对多休息"的方式分解,因此针对所有类别训练了单独的二进制分类器.

It cannot learn a true multinomial (multiclass) model; instead, the optimization problem is decomposed in a "one-vs-rest" fashion so separate binary classifiers are trained for all classes.

侧面说明:根据Scikit文档:由于历史原因,默认情况下使用"liblinear"求解器.

4.随机平均梯度:

SAG方法优化了有限数量的光滑凸函数的和.与随机梯度(SG)方法一样,SAG方法的迭代成本与总和中项的数量无关.但是,通过 结合以前的梯度值的存储,SAG方法比黑盒SG方法获得更快的收敛速度 .

SAG method optimizes the sum of a finite number of smooth convex functions. Like stochastic gradient (SG) methods, the SAG method's iteration cost is independent of the number of terms in the sum. However, by incorporating a memory of previous gradient values the SAG method achieves a faster convergence rate than black-box SG methods.

当样本数量和特征数量都很大时,它比大型数据集的其他求解器更快.

It is faster than other solvers for large datasets, when both the number of samples and the number of features are large.

缺点:

  1. 它仅支持L2惩罚.

  1. It only supports L2 penalization.

它的内存成本O(N),这对于大的N来说是不切实际的(因为它会记住大约所有梯度的最新计算值).

Its memory cost of O(N), which can make it impractical for large N (because it remembers the most recently computed values for approx. all gradients).

5.传奇:

SAGA求解器是SAG的变体,它还支持非平滑的 penalty = l1 选项(即L1正则化).因此,这是 稀疏 多项式逻辑回归的首选求解器,并且它也适合 非常大 的数据集.

The SAGA solver is a variant of SAG that also supports the non-smooth penalty=l1 option (i.e. L1 Regularization). This is therefore the solver of choice for sparse multinomial logistic regression and it’s also suitable very Large dataset.

侧面说明:根据Scikit文档:SAGA求解器通常是最佳选择.

下表摘自 Scikit文档

这篇关于逻辑回归python求解器的定义的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆