NumPy 中的 Softmax 导数接近 0(实现) [英] Softmax derivative in NumPy approaches 0 (implementation)

查看:24
本文介绍了NumPy 中的 Softmax 导数接近 0(实现)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试为用 Numpy 编写的神经网络实现 softmax 函数.让 h 是给定信号 i 的 softmax 值.

我一直在努力实现 softmax 激活函数的偏导数.

我目前遇到的问题是,随着训练的进行,所有偏导数都接近 0.我已经用这个很好的答案交叉引用了我的数学,但是我的数学似乎不太好.

将 numpy 导入为 npdef softmax_function(信号,导数=假):# 计算激活信号e_x = np.exp( 信号 )信号 = e_x/np.sum( e_x, 轴 = 1, keepdims = True )如果衍生:# 返回激活函数的部分推导返回 np.multiply( 信号, 1 - 信号 ) + sum(# 处理非对角线值- 信号 * np.roll( 信号, i, 轴 = 1 )对于 i 在 xrange(1, signal.shape[1] ))别的:# 返回激活信号返回信号#结束激活函数

signal 参数包含发送到激活函数的输入信号,其形状为 (n_samples, n_features).

# 样本信号(3 个样本,3 个特征)信号= [[0.3394572666491664,0.3089068053925853,0.3516359279582483],[0.33932706934615525,0.3094755563319447,0.3511973743219001],[0.3394407172182317,0.30889042266755573,0.35166886011421256]]

截取的以下代码是一个完全有效的激活函数,仅作为参考和证明(主要是为我自己)提供,证明概念性想法确实有效.

from scipy.special import expit将 numpy 导入为 npdef sigmoid_function(信号,导数=假):# 防止溢出.信号 = np.clip( 信号, -500, 500 )# 计算激活信号信号 = 出口(信号)如果衍生:# 返回激活函数的部分推导返回 np.multiply(signal, 1 - 信号)别的:# 返回激活信号返回信号#结束激活函数

编辑

  • 问题在简单的单层网络中直观地存在.softmax(及其衍生物)应用于最后一层.

解决方案

这是关于如何以更矢量化的 numpy 方式计算 softmax 函数的导数的答案.然而,偏导数趋近于零的事实可能不是数学问题,而只是复杂深度神经网络的学习率问题或已知的垂死权重问题.ReLU 之类的层有助于防止后一个问题.

<小时>

首先,我使用了以下信号(只是复制了您的最后一个条目)使其4 个样本 x 3 个特征,以便更容易了解尺寸的变化.

<预><代码>>>>信号= [[0.3394572666491664,0.3089068053925853,0.3516359279582483],[0.33932706934615525,0.3094755563319447,0.3511973743219001],[0.3394407172182317,0.30889042266755573,0.35166886011421256],[0.3394407172182317,0.30889042266755573,0.35166886011421256]]>>>信号形状(4, 3)

接下来,您要计算 softmax 函数的雅可比矩阵.根据引用的页面,它被定义为 -hi * hj 用于非对角线条目(n_features > 2 的矩阵的大部分),所以让我们从那里开始.在 numpy 中,您可以使用 广播:

<预><代码>>>>J = - 信号[..., 无] * 信号[:, 无, :]>>>J.形状(4, 3, 3)

第一个signal[..., None](相当于signal[:, :, None])将信号重塑为(4, 3, 1) 而第二个 signal[:, None, :] 将信号重塑为 (4, 1, 3).然后, * 只是将两个矩阵元素相乘.Numpy 的内部广播重复两个矩阵以形成每个样本的 n_features x n_features 矩阵.

然后,我们需要修复对角线元素:

<预><代码>>>>iy, ix = np.diag_indices_from(J[0])>>>J[:, iy, ix] = 信号 * (1. - 信号)

以上几行提取n_features x n_features 矩阵的对角线索引.它相当于做 iy = np.arange(n_features);ix = np.arange(n_features).然后,用您的定义 hi * (1 - hi) 替换对角线条目.

最后,根据链接的来源,您需要对每个样本的行求和.可以这样做:

<预><代码>>>>J = J.sum(axis=1)>>>J.形状(4, 3)

在下面找到一个总结版本:

如果派生:J = - 信号[..., 无] * 信号[:, 无, :] # 非对角雅可比行列式iy, ix = np.diag_indices_from(J[0])J[:, iy, ix] = 信号 * (1. - 信号) # 对角线return J.sum(axis=1) # 每个样本的跨行求和

<小时>

导数比较:

<预><代码>>>>信号= [[0.3394572666491664,0.3089068053925853,0.3516359279582483],[0.33932706934615525,0.3094755563319447,0.3511973743219001],[0.3394407172182317,0.30889042266755573,0.35166886011421256],[0.3394407172182317,0.30889042266755573,0.35166886011421256]]>>>e_x = np.exp( 信号 )>>>信号 = e_x/np.sum( e_x, 轴 = 1, keepdims = True )

你的:

<预><代码>>>>np.multiply( 信号, 1 - 信号 ) + sum(# 处理非对角线值- 信号 * np.roll( 信号, i, 轴 = 1 )对于 i 在 xrange(1, signal.shape[1] ))数组([[ 2.77555756e-17, -2.77555756e-17, 0.00000000e+00],[ -2.77555756e-17, -2.77555756e-17, -2.77555756e-17],[ 2.77555756e-17, 0.00000000e+00, 2.77555756e-17],[ 2.77555756e-17, 0.00000000e+00, 2.77555756e-17]])

我的:

<预><代码>>>>J = 信号[..., 无] * 信号[:, 无, :]>>>iy, ix = np.diag_indices_from(J[0])>>>J[:, iy, ix] = 信号 * (1. - 信号)>>>J.sum(axis=1)数组([[4.16333634e-17,-1.38777878e-17,0.00000000e+00],[ -2.77555756e-17, -2.77555756e-17, -2.77555756e-17],[ 2.77555756e-17, 1.38777878e-17, 2.77555756e-17],[ 2.77555756e-17, 1.38777878e-17, 2.77555756e-17]])

I'm trying to implement the softmax function for a neural network written in Numpy. Let h be the softmax value of a given signal i.

I've struggled to implement the softmax activation function's partial derivative.

I'm currently stuck at issue where all the partial derivatives approaches 0 as the training progresses. I've cross-referenced my math with this excellent answer, but my math does not seem to work out.

import numpy as np
def softmax_function( signal, derivative=False ):
    # Calculate activation signal
    e_x = np.exp( signal )
    signal = e_x / np.sum( e_x, axis = 1, keepdims = True )

    if derivative:
        # Return the partial derivation of the activation function
        return np.multiply( signal, 1 - signal ) + sum(
            # handle the off-diagonal values
            - signal * np.roll( signal, i, axis = 1 )
            for i in xrange(1, signal.shape[1] )
        )
    else:
        # Return the activation signal
        return signal
#end activation function

The signal parameter contains the input signal sent into the activation function and has the shape (n_samples, n_features).

# sample signal (3 samples, 3 features)
signal = [[0.3394572666491664, 0.3089068053925853, 0.3516359279582483], [0.33932706934615525, 0.3094755563319447, 0.3511973743219001], [0.3394407172182317, 0.30889042266755573, 0.35166886011421256]]

The following code snipped is a fully working activation function and is only included as a reference and proof (mostly for myself) that the conceptual idea actually work.

from scipy.special import expit
import numpy as np
def sigmoid_function( signal, derivative=False ):
    # Prevent overflow.
    signal = np.clip( signal, -500, 500 )

    # Calculate activation signal
    signal = expit( signal )

    if derivative:
        # Return the partial derivation of the activation function
        return np.multiply(signal, 1 - signal)
    else:
        # Return the activation signal
        return signal
#end activation function

Edit

  • The problem intuitively persist with simple single layer networks. The softmax (and its derivative) is applied at the final layer.

解决方案

This is an answer on how to calculate the derivative of the softmax function in a more vectorized numpy fashion. However, the fact that the partial derivatives approach to zero might not be a math issue, and just be a problem of the learning rate or the known dying weight issue with complex deep neural networks. Layers like ReLU help preventing the latter issue.


First, I've used the following signal (just duplicating your last entry) to make it 4 samples x 3 features so is easier to see what is going on with the dimensions.

>>> signal = [[0.3394572666491664, 0.3089068053925853, 0.3516359279582483], [0.33932706934615525, 0.3094755563319447, 0.3511973743219001], [0.3394407172182317, 0.30889042266755573, 0.35166886011421256], [0.3394407172182317, 0.30889042266755573, 0.35166886011421256]]
>>> signal.shape
(4, 3)

Next, you want to compute the Jacobian matrix of your softmax function. According to the cited page it is defined as -hi * hj for the off-diagonal entries (majority of the matrix for n_features > 2), so lets start there. In numpy, you can efficiently calculate that Jacobian matrix using broadcasting:

>>> J = - signal[..., None] * signal[:, None, :]
>>> J.shape
(4, 3, 3)

The first signal[..., None] (equivalent to signal[:, :, None]) reshapes the signal to (4, 3, 1) while the second signal[:, None, :] reshapes the signal to (4, 1, 3). Then, the * just multiplies both matrices element-wise. Numpy's internal broadcasting repeats both matrices to form the n_features x n_features matrix for every sample.

Then, we need to fix the diagonal elements:

>>> iy, ix = np.diag_indices_from(J[0])
>>> J[:, iy, ix] = signal * (1. - signal)

The above lines extract diagonal indices for n_features x n_features matrix. It is equivalent of doing iy = np.arange(n_features); ix = np.arange(n_features). Then, replaces the diagonal entries with your defitinion hi * (1 - hi).

Last, according to the linked source, you need to sum across rows for each of the samples. That can be done as:

>>> J = J.sum(axis=1)
>>> J.shape
(4, 3)

Find bellow a summarized version:

if derivative:
    J = - signal[..., None] * signal[:, None, :] # off-diagonal Jacobian
    iy, ix = np.diag_indices_from(J[0])
    J[:, iy, ix] = signal * (1. - signal) # diagonal
    return J.sum(axis=1) # sum across-rows for each sample


Comparison of the derivatives:

>>> signal = [[0.3394572666491664, 0.3089068053925853, 0.3516359279582483], [0.33932706934615525, 0.3094755563319447, 0.3511973743219001], [0.3394407172182317, 0.30889042266755573, 0.35166886011421256], [0.3394407172182317, 0.30889042266755573, 0.35166886011421256]]
>>> e_x = np.exp( signal )
>>> signal = e_x / np.sum( e_x, axis = 1, keepdims = True )

Yours:

>>> np.multiply( signal, 1 - signal ) + sum(
        # handle the off-diagonal values
        - signal * np.roll( signal, i, axis = 1 )
        for i in xrange(1, signal.shape[1] )
    )
array([[  2.77555756e-17,  -2.77555756e-17,   0.00000000e+00],
       [ -2.77555756e-17,  -2.77555756e-17,  -2.77555756e-17],
       [  2.77555756e-17,   0.00000000e+00,   2.77555756e-17],
       [  2.77555756e-17,   0.00000000e+00,   2.77555756e-17]])

Mine:

>>> J = signal[..., None] * signal[:, None, :]
>>> iy, ix = np.diag_indices_from(J[0])
>>> J[:, iy, ix] = signal * (1. - signal)
>>> J.sum(axis=1)
array([[  4.16333634e-17,  -1.38777878e-17,   0.00000000e+00],
       [ -2.77555756e-17,  -2.77555756e-17,  -2.77555756e-17],
       [  2.77555756e-17,   1.38777878e-17,   2.77555756e-17],
       [  2.77555756e-17,   1.38777878e-17,   2.77555756e-17]])

这篇关于NumPy 中的 Softmax 导数接近 0(实现)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆