我将神经网络的示例修改了几层,以查看是否可以.它出什么问题了? [英] I modified a few layers to an example of a neural network just to see if I could. What's wrong with it?

查看:60
本文介绍了我将神经网络的示例修改了几层,以查看是否可以.它出什么问题了?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我发现一个简单的神经网络具有w1,Relu和w2层.我尝试在中间添加新的权重图层,然后在其后添加第二个Relu.因此,这些层如下所示:w1,Relu,w_mid,Relu和w2.
如果可以运行的话,它比原始的3层网络要慢得多.我不确定是否所有东西都可以向前通过,并且后向支撑是否可以在每个部件上正常工作.
神经网络来自此链接.这是页面下方的第三段代码.

A simple neural network I found had the layers w1, Relu, and w2. I tried to add a new weight layer in the middle and a second Relu after it. So, the layers are as follows w1, Relu, w_mid, Relu, and w2.
It is much much slower than the original 3 layer network if it works at all. I'm not sure if everything is getting a forward pass and if back prop is working across every part it is supposed to.
The neural network is from this link. It is the third block of code down the page.

这是我更改的代码.
在它的下面是原图.

This is the code I changed.
Below it is the original.

    import torch
    dtype = torch.float
    device = torch.device("cpu")
    #device = torch.device("cuda:0") # Uncomment this to run on GPU

    # N is batch size; D_in is input dimension;
    # H is hidden dimension; D_out is output dimension.
    N, D_in, H, D_out = 64, 250, 250, 10

    # Create random input and output data
    x = torch.randn(N, D_in, device=device, dtype=dtype)
    y = torch.randn(N, D_out, device=device, dtype=dtype)

    # Randomly initialize weights
    w1 = torch.randn(D_in, H, device=device, dtype=dtype)
    w_mid = torch.randn(H, H, device=device, dtype=dtype)
    w2 = torch.randn(H, D_out, device=device, dtype=dtype)

    learning_rate = 1e-5
    for t in range(5000):
        # Forward pass: compute predicted y
        h = x.mm(w1)
        h_relu = h.clamp(min=0)
        k = h_relu.mm(w_mid)
        k_relu = k.clamp(min=0)
        y_pred = k_relu.mm(w2)


        # Compute and print loss
        loss = (y_pred - y).pow(2).sum().item()
        if t % 1000 == 0:
            print(t, loss)

        # Backprop to compute gradients of w1, mid, and w2 with respect to loss
        grad_y_pred = (y_pred - y) * 2
        grad_w2 = k_relu.t().mm(grad_y_pred)
        grad_k_relu = grad_y_pred.mm(w2.t())
        grad_k = grad_k_relu.clone()
        grad_k[k < 0] = 0
        grad_mid = h_relu.t().mm(grad_k)
        grad_h_relu = grad_k.mm(w1.t())
        grad_h = grad_h_relu.clone()
        grad_h[h < 0] = 0
        grad_w1 = x.t().mm(grad_h)

        # Update weights
        w1 -= learning_rate * grad_w1
        w_mid -= learning_rate * grad_mid
        w2 -= learning_rate * grad_w2  

损失是..
0 1904074240.0
1000 639.4848022460938
2000 639.4848022460938
3000 639.4848022460938
4000 639.4848022460938

The loss is ..
0 1904074240.0
1000 639.4848022460938
2000 639.4848022460938
3000 639.4848022460938
4000 639.4848022460938

这是Pytorch网站上的原始代码.

This is the original code from the Pytorch website.

    import torch


    dtype = torch.float
    #device = torch.device("cpu")
    device = torch.device("cuda:0") # Uncomment this to run on GPU

    # N is batch size; D_in is input dimension;
    # H is hidden dimension; D_out is output dimension.
    N, D_in, H, D_out = 64, 1000, 100, 10

    # Create random input and output data
    x = torch.randn(N, D_in, device=device, dtype=dtype)
    y = torch.randn(N, D_out, device=device, dtype=dtype)

    # Randomly initialize weights
    w1 = torch.randn(D_in, H, device=device, dtype=dtype)
    w2 = torch.randn(H, D_out, device=device, dtype=dtype)

    learning_rate = 1e-6
    for t in range(500):
        # Forward pass: compute predicted y
        h = x.mm(w1)
        h_relu = h.clamp(min=0)
        y_pred = h_relu.mm(w2)

        # Compute and print loss
        loss = (y_pred - y).pow(2).sum().item()
        if t % 100 == 99:
            print(t, loss)

        # Backprop to compute gradients of w1 and w2 with respect to loss
        grad_y_pred = 2.0 * (y_pred - y)
        grad_w2 = h_relu.t().mm(grad_y_pred)
        grad_h_relu = grad_y_pred.mm(w2.t())
        grad_h = grad_h_relu.clone()
        grad_h[h < 0] = 0
        grad_w1 = x.t().mm(grad_h)

        # Update weights using gradient descent
        w1 -= learning_rate * grad_w1
        w2 -= learning_rate * grad_w2

推荐答案

h_relu 的梯度计算不正确.

grad_h_relu = grad_k.mm(w1.t())

应该是 w_mid 而不是 w1 :

grad_h_relu = grad_k.mm(w_mid.t())

除此之外,计算是正确的,但是您应该降低学习率,因为开始时梯度很大,使得权重非常大,并导致溢出值(无穷大),进而产生NaN损耗和梯度.这就是爆炸梯度.

Other than that, the calculations are correct, but you should lower the learning rate, as the gradients are very big at the beginning, making the weights very large and that leads to overflowing values (infinity), which in turn produce NaN losses and gradients. This is known as exploding gradients.

在您的示例中,1e-8 的学习率似乎有效.

In your example a learning rate of 1e-8 seems to work.

这篇关于我将神经网络的示例修改了几层,以查看是否可以.它出什么问题了?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆