为什么即使余量很大,单层感知器在不进行归一化的情况下收敛得如此缓慢? [英] Why does single-layer perceptron converge so slow without normalization, even when the margin is large?

查看:129
本文介绍了为什么即使余量很大,单层感知器在不进行归一化的情况下收敛得如此缓慢?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在我确认结果后,这个问题完全被重写了(可以在在此)和其他人编写的一段代码(可以在此处).这是我为处理数据和计数历时直至收敛而编写的代码:

import numpy as np
from matplotlib import pyplot as plt

class Perceptron(object):
    """Implements a perceptron network"""
    def __init__(self, input_size, lr=0.1, epochs=1000000):
        self.W = np.zeros(input_size+1)
        #self.W = np.random.randn(input_size+1)
        # add one for bias
        self.epochs = epochs
        self.lr = lr

    def predict(self, x):
        z = self.W.T.dot(x)
        return [1 if self.W.T.dot(x) >=0 else 0]

    def fit(self, X, d):
        errors = []
        for epoch in range(self.epochs):
            if (epoch + 1) % 10000 == 0: print('Epoch',epoch + 1)
            total_error = 0
            for i in range(d.shape[0]):
                x = np.insert(X[i], 0, 1)
                y = self.predict(x)
                e = d[i] - y
                total_error += np.abs(e)
                self.W = self.W + self.lr * e * x
                #print('W: ', self.W)
            errors += [total_error]
            if (total_error == 0):
                print('Done after', epoch, 'epochs')
                nPlot = 100
                plt.plot(list(range(len(errors)-nPlot, len(errors))), errors[-nPlot:])
                plt.show()
                break

if __name__ == '__main__':
    trainingSet = np.array([[279.25746446, 162.44072328,   1.        ],
                            [306.23240054, 128.3794866 ,   1.        ],
                            [216.67811217, 148.58167262,   1.        ],
                            [223.64431813, 197.75745016,   1.        ],
                            [486.68209275,  96.09115377,   1.        ],
                            [400.71323154, 125.18183395,   1.        ],
                            [288.87299305, 204.52217766,   1.        ],
                            [245.1492875 ,  55.75847006,  -1.        ],
                            [ 14.95991122, 185.92681911,   1.        ],
                            [393.92908798, 193.40527965,   1.        ],
                            [494.15988362, 179.23456285,   1.        ],
                            [235.59039363, 175.50868526,   1.        ],
                            [423.72071607,   9.50166894,  -1.        ],
                            [ 76.52735621, 208.33663341,   1.        ],
                            [495.1492875 ,  -7.73818431,  -1.        ]])
    X = trainingSet[:, :2]
    d = trainingSet[:, -1]
    d = np.where(d == -1, 1, 0)
    perceptron = Perceptron(input_size=2)
    perceptron.fit(X, d)
    print(perceptron.W)

训练集由15个点组成,具有较大的分离余量. Perceptron算法会找到一个分隔符,如下所示,但是经过了多达 122,346 个时期:

Wikipedia文章解释说,Perceptron所需的时期数收敛与向量大小的平方成正比,与边界的平方成反比.在我的数据中,向量的大小很大,但是裕量也很大.

我试图了解为什么需要这么多的纪元.

更新:根据注释中的请求,我更新了代码以绘制最近100个时期的总错误.这是情节:

P.S .:将要素缩放为N(0,1)后,该算法在两个时期后收敛.但是,我不理解为什么即使不进行这种缩放,该算法也无法在合理的时间内收敛.

解决方案

一个月前我不能正确回答您的问题时,我感到后悔.现在我再试一次.我将较旧的答案留作记录.

我认为问题与损失函数的凸性和局部最小值有关,这使其难以收敛.但是,对于您在设置时遇到的问题,我不太确定损失函数的导数,因此我将激活函数修改为S型,因此我可以轻松地应用log损失. /p>

这是新的predict

def predict(self, x):
    z = self.W.T.dot(x)
    return 1/(1+np.exp(-z))

这是训练数据的循环,也可以计算损失.

 loss = 0
 dw = 0
 for i in range(d.shape[0]):
     x = np.insert(X[i], 0, 1)
     y = self.predict(x)
     e = d[i] - (1 if y > 0.5 else 0)
     total_error += np.abs(e)
     dw += self.lr * e * x
     loss2add = (-1) * (np.log(y) if d[i] else np.log(1-y))
     if np.isinf(loss2add) or np.isnan(loss2add):
         loss += 500
     else:
         loss += loss2add
 self.W = self.W + dw
 errors += [total_error]
 losses += [loss/d.shape[0]]

它收敛于103K纪元,所以我希望您相信它的行为与您的初始设置类似.

然后我绘制与W相关的成本函数.为简单起见,我采用一个已知解决方案的2个值,而仅更改剩余的1个值.这是代码(我知道可能更干净):

def predict(W, x):
    z = W.dot(x)
    return 1/(1+np.exp(-z))

trainingSet = np.array([[279.25746446, 162.44072328,   1.        ],
                        [306.23240054, 128.3794866 ,   1.        ],
                        [216.67811217, 148.58167262,   1.        ],
                        [223.64431813, 197.75745016,   1.        ],
                        [486.68209275,  96.09115377,   1.        ],
                        [400.71323154, 125.18183395,   1.        ],
                        [288.87299305, 204.52217766,   1.        ],
                        [245.1492875 ,  55.75847006,  -1.        ],
                        [ 14.95991122, 185.92681911,   1.        ],
                        [393.92908798, 193.40527965,   1.        ],
                        [494.15988362, 179.23456285,   1.        ],
                        [235.59039363, 175.50868526,   1.        ],
                        [423.72071607,   9.50166894,  -1.        ],
                        [ 76.52735621, 208.33663341,   1.        ],
                        [495.1492875 ,  -7.73818431,  -1.        ]])
X = trainingSet[:, :2]
d = trainingSet[:, -1]
d = np.where(d == -1, 1, 0)
losses = []
ws = []
n_points = 10001
for w1 in np.linspace(-40, 40, n_points):
    ws += [w1]
    W = np.array([3629., w1, -238.21109877])
    loss = 0
    for i in range(d.shape[0]):
        x = np.insert(X[i], 0, 1)
        y = predict(W, x)
        loss2add = (-1) * (np.log(y) if d[i] else np.log(1-y))
        if np.isinf(loss2add) or np.isnan(loss2add):
            loss += 500
        else:
            loss += loss2add
    losses += [loss]
plt.plot(ws, losses)
plt.show()

w1的解决方案是39.48202635.看一下损失:

具有一些峰,因此有一些局部极小值,很容易卡在其中.

但是,如果您将数据居中

X = scale(X, with_mean=True, with_std=False)

并将w设置为

W = np.array([-550.3, w1, -59.65467824])

您将获得以下损失函数

,它在预期区域最小(w1的解是-11.00208344).

我希望平衡数据集的函数更平滑.

希望现在更清楚!


评论后编辑

这是标准化时的损失函数-收敛于26个时期.

(在这种情况下不居中!)

解决方案约为0.7,并且损失甚至更平滑.合理化在逻辑回归中非常有效,因为它不会使激活函数的输出饱和.

对于其余部分,我没有什么可补充的,可如何将它们与您提到的理论相适应.我猜该定理确定了一个上限,但是无论如何都不知道.干杯.

This question is totally re-written after I confirmed my results (the Python Notebook can be found here) with a piece of code written by someone else (can be found here). Here is that code instrumented by me to work with my data and to count epochs till convergence:

import numpy as np
from matplotlib import pyplot as plt

class Perceptron(object):
    """Implements a perceptron network"""
    def __init__(self, input_size, lr=0.1, epochs=1000000):
        self.W = np.zeros(input_size+1)
        #self.W = np.random.randn(input_size+1)
        # add one for bias
        self.epochs = epochs
        self.lr = lr

    def predict(self, x):
        z = self.W.T.dot(x)
        return [1 if self.W.T.dot(x) >=0 else 0]

    def fit(self, X, d):
        errors = []
        for epoch in range(self.epochs):
            if (epoch + 1) % 10000 == 0: print('Epoch',epoch + 1)
            total_error = 0
            for i in range(d.shape[0]):
                x = np.insert(X[i], 0, 1)
                y = self.predict(x)
                e = d[i] - y
                total_error += np.abs(e)
                self.W = self.W + self.lr * e * x
                #print('W: ', self.W)
            errors += [total_error]
            if (total_error == 0):
                print('Done after', epoch, 'epochs')
                nPlot = 100
                plt.plot(list(range(len(errors)-nPlot, len(errors))), errors[-nPlot:])
                plt.show()
                break

if __name__ == '__main__':
    trainingSet = np.array([[279.25746446, 162.44072328,   1.        ],
                            [306.23240054, 128.3794866 ,   1.        ],
                            [216.67811217, 148.58167262,   1.        ],
                            [223.64431813, 197.75745016,   1.        ],
                            [486.68209275,  96.09115377,   1.        ],
                            [400.71323154, 125.18183395,   1.        ],
                            [288.87299305, 204.52217766,   1.        ],
                            [245.1492875 ,  55.75847006,  -1.        ],
                            [ 14.95991122, 185.92681911,   1.        ],
                            [393.92908798, 193.40527965,   1.        ],
                            [494.15988362, 179.23456285,   1.        ],
                            [235.59039363, 175.50868526,   1.        ],
                            [423.72071607,   9.50166894,  -1.        ],
                            [ 76.52735621, 208.33663341,   1.        ],
                            [495.1492875 ,  -7.73818431,  -1.        ]])
    X = trainingSet[:, :2]
    d = trainingSet[:, -1]
    d = np.where(d == -1, 1, 0)
    perceptron = Perceptron(input_size=2)
    perceptron.fit(X, d)
    print(perceptron.W)

The training set consists of 15 points, with a large separation margin. The Perceptron algorithm finds a separator as shown below, but after as many as 122,346 epochs:

As the Wikipedia article explains, the number of epochs needed by the Perceptron to converge is proportional to the square of the size of the vectors and inverse-proportional to the square of the margin. In my data, the size of the vectors is large, but the margin is large as well.

I seek to understand why so many epochs are required.

Update: As per the request in the comments, I updated the code to plot the total errors of the last 100 epochs. Here is the plot:

P.S.:After scaling the features to be distributed as N(0,1), the algorithm converges after two epochs. However, I do not grasp why the algorithm would not converge in a reasonable amount of time even without such scaling.

解决方案

When I could not answer properly your question one month ago I kind of regreted it; now I give it another try. I leave the older answer for the record.

I think the problem relates to convexity and local minima of the loss function, which makes it difficult to converge. However, with your problem as you did set it up, I'm not really sure about the derivative of your loss function, so I've modified your activation function to a sigmoid, so I can apply the log loss easily.

This is the new predict,

def predict(self, x):
    z = self.W.T.dot(x)
    return 1/(1+np.exp(-z))

And this is the loop for the training data, also calculating the loss.

 loss = 0
 dw = 0
 for i in range(d.shape[0]):
     x = np.insert(X[i], 0, 1)
     y = self.predict(x)
     e = d[i] - (1 if y > 0.5 else 0)
     total_error += np.abs(e)
     dw += self.lr * e * x
     loss2add = (-1) * (np.log(y) if d[i] else np.log(1-y))
     if np.isinf(loss2add) or np.isnan(loss2add):
         loss += 500
     else:
         loss += loss2add
 self.W = self.W + dw
 errors += [total_error]
 losses += [loss/d.shape[0]]

It converges in 103K epochs, so I hope you believe this behaves similarly to your initial setup.

Then I plot the cost function related to W. To make it simple, I take 2 values of a known solution, and only change the remaining 1 value. This is the code (could be cleaner I know):

def predict(W, x):
    z = W.dot(x)
    return 1/(1+np.exp(-z))

trainingSet = np.array([[279.25746446, 162.44072328,   1.        ],
                        [306.23240054, 128.3794866 ,   1.        ],
                        [216.67811217, 148.58167262,   1.        ],
                        [223.64431813, 197.75745016,   1.        ],
                        [486.68209275,  96.09115377,   1.        ],
                        [400.71323154, 125.18183395,   1.        ],
                        [288.87299305, 204.52217766,   1.        ],
                        [245.1492875 ,  55.75847006,  -1.        ],
                        [ 14.95991122, 185.92681911,   1.        ],
                        [393.92908798, 193.40527965,   1.        ],
                        [494.15988362, 179.23456285,   1.        ],
                        [235.59039363, 175.50868526,   1.        ],
                        [423.72071607,   9.50166894,  -1.        ],
                        [ 76.52735621, 208.33663341,   1.        ],
                        [495.1492875 ,  -7.73818431,  -1.        ]])
X = trainingSet[:, :2]
d = trainingSet[:, -1]
d = np.where(d == -1, 1, 0)
losses = []
ws = []
n_points = 10001
for w1 in np.linspace(-40, 40, n_points):
    ws += [w1]
    W = np.array([3629., w1, -238.21109877])
    loss = 0
    for i in range(d.shape[0]):
        x = np.insert(X[i], 0, 1)
        y = predict(W, x)
        loss2add = (-1) * (np.log(y) if d[i] else np.log(1-y))
        if np.isinf(loss2add) or np.isnan(loss2add):
            loss += 500
        else:
            loss += loss2add
    losses += [loss]
plt.plot(ws, losses)
plt.show()

The solution for w1 is 39.48202635. Take a look at the loss:

which has some peaks and thus some local minima in which it can get easily stuck.

However if you center the data with

X = scale(X, with_mean=True, with_std=False)

and set the w's to

W = np.array([-550.3, w1, -59.65467824])

you get the following loss function

which has the minimum at the expected area (the solution for w1 is -11.00208344).

I'd expect a smoother function for the balanced dataset.

Hope it is clearer now!


EDIT after comments

This is the loss function when standarizing -converges in 26 epochs.

(Not centering in this case!)

Solution about 0.7, and the loss is even smoother. It makes sense that standarizing works so well with logistic regression, since it does not saturate the output of the activation function.

For the rest, I don't have anything to add on how to fit these with the theory you mention. I guess the theorem fixes an upper bound, but anyhow no idea. Cheers.

这篇关于为什么即使余量很大,单层感知器在不进行归一化的情况下收敛得如此缓慢?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆