为什么在训练数据时我的 GPU 会中断? [英] Why is my GPU getting interrupted when training my data?

查看:102
本文介绍了为什么在训练数据时我的 GPU 会中断?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我花了几个小时来配置我的电脑,最终在 GPU 而不是 CPU 上制作了 python 训练数据.但是,由于某种原因,我的模型在它们的 epoch 中途不断被中断,我无法完成模型的训练.

I spent hours configuring my computer and finally got to make python train data on GPUs instead of CPU. However, for some reason my models keep getting interrupted halfway during the middle of their epochs and I cannot complete the training of the models.

等待计算机并不能解决这个问题,我也无法中断内核.我尝试了其他人的解决方案,但仍然没有多少运气.

Waiting for the computer does not solve this problem and I cannot interrupt the kernel either. I tried other people's solution and still don't have much luck.

如果我使用 CPU(以爬行速度),我可以正常训练我的模型,但是当我切换到 GPU 时,我的模型训练得非常快,然后它们在中途挂断,没有完成所需的所有 epoch.在那之后,我的 python 内核也卡在运行中,除非我从任务管理器中终止整个过程,否则我无法中断它.

I can train my model normally if I'm using a CPU (at a crawling rate), but when I switch to GPU my model trains really fast before they hang up half way, not completing all the epochs that are required. My python kernel also gets stuck on running after that and I cannot interrupt it unless I terminate the whole thing from task manager.

从我的任务管理器性能历史记录来看,在训练期间,我的 GPU 出现了持续的峰值,这是意料之中的.但是当它挂断时,我的 GPU 活动又回到了 0,即使我的内核表明训练仍处于其 epoch 的中间.这是随机发生的,不依赖于时间或时期的数量,尽管我训练数据的时间越长,这种情况就越有可能发生.

From my task manager performance history, during training there is a sustained spike at my GPU, which is expected. But when it hangs up my GPU activity goes back to 0, even though my kernel indicates that the training is still in the middle of its epoch. This happens randomly and is not dependent on the timing or the number of epochs, although it is more likely to happen the longer I train the data.

这是我的顺序模型.

def prepare_sequences(notes, n_vocab, seq_len):
    """ Prepare the sequences used by the Neural Network """
    sequence_length = seq_len

    names = sorted(set(item for item in notes))
    note_to_int = dict((note, number) for number, note in enumerate(names))

    network_input = []
    network_output = []

    # create input sequences and the corresponding outputs
    for i in range(0, len(notes) - sequence_length, 1):
        sequence_in = notes[i:i + sequence_length]
        sequence_out = notes[i + sequence_length]
        network_input.append([note_to_int[char] for char in sequence_in])
        network_output.append(note_to_int[sequence_out])

    n_patterns = len(network_input)

    # reshape the input into a format compatible with LSTM layers
    network_input = numpy.reshape(network_input, (n_patterns, sequence_length, 1))
    # normalize input
    network_input = network_input / float(n_vocab)

    network_output = np_utils.to_categorical(network_output)

    return (network_input, network_output)

def create_network(network_input, n_vocab, LSTM_node_count, Dropout_count):
    """ create the structure of the neural network """
    model = Sequential()
    model.add(LSTM(
        LSTM_node_count,
        input_shape=(network_input.shape[1], network_input.shape[2]),
        recurrent_dropout= Dropout_count,
        return_sequences=True
    ))
    model.add(LSTM(
        LSTM_node_count, 
        return_sequences=True, 
        recurrent_dropout= Dropout_count,))
    model.add(LSTM(LSTM_node_count))
    model.add(BatchNorm())
    model.add(Dropout(Dropout_count))
    model.add(Dense(256))
    model.add(Activation('relu'))
    model.add(BatchNorm())
    model.add(Dropout(Dropout_count))
    model.add(Dense(n_vocab))
    model.add(Activation('softmax'))
    model.compile(loss='categorical_crossentropy', optimizer='rmsprop')

    return model

def train(model, network_input, network_output, epoch, batchsize):
    """ train the neural network """
    filepath = "trained_weights/" + "weights-improvement-{epoch:02d}-{loss:.4f}-bigger.hdf5"
    checkpoint = ModelCheckpoint(
        filepath,
        monitor='loss',
        verbose=0,
        save_best_only= True,
        mode='min'
    )
    callbacks_list = [checkpoint]

    model.fit(network_input, 
              network_output, 
              epochs= epoch,
              batch_size= batchsize, 
              callbacks= callbacks_list)

configproto = tf.compat.v1.ConfigProto() 
configproto.gpu_options.allow_growth = True
configproto.gpu_options.polling_inactive_delay_msecs = 10
sess = tf.compat.v1.Session(config=configproto) 
tf.compat.v1.keras.backend.set_session(sess)

在训练期间,我也收到一条警告消息,我不知道是什么意思.

During training, I also get a warning message and I don't know what it means.

WARNING:tensorflow:Layer lstm will not use cuDNN kernel since it doesn't meet the cuDNN kernel criteria. It will use generic GPU kernel as fallback when running on GPU
WARNING:tensorflow:Layer lstm_1 will not use cuDNN kernel since it doesn't meet the cuDNN kernel criteria. It will use generic GPU kernel as fallback when running on GPU

C:\Users\David>nvidia-smi
Sun Dec 27 15:56:16 2020
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.89       Driver Version: 460.89       CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name            TCC/WDDM | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  GeForce GTX 1050   WDDM  | 00000000:01:00.0 Off |                  N/A |
| N/A   47C    P8    N/A /  N/A |    120MiB /  4096MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A      5496    C+G   ...5n1h2txyewy\SearchApp.exe    N/A      |
|    0   N/A  N/A      7372    C+G   ...nputApp\TextInputHost.exe    N/A      |
|    0   N/A  N/A      8268    C+G   ...wekyb3d8bbwe\Music.UI.exe    N/A      |
|    0   N/A  N/A      9420    C+G   ...artMenuExperienceHost.exe    N/A      |
|    0   N/A  N/A     10084    C+G   ...ekyb3d8bbwe\YourPhone.exe    N/A      |
|    0   N/A  N/A     11292    C+G   Insufficient Permissions        N/A      |
|    0   N/A  N/A     14684    C+G   ...cw5n1h2txyewy\LockApp.exe    N/A      |
+-----------------------------------------------------------------------------+

我目前使用的是 tensorflow 2.4,CUDA 11.2,

I am currently using tensorflow 2.4, CUDA 11.2,

推荐答案

您正在使用 recurrent_dropout >0 不符合 LSTM 兼容性要求以确保 CuDNN优化.使 recurrent_dropout = 0 解决问题.

You are using recurrent_dropout > 0 which does not meet the LSTM compatibility requirements to ensure CuDNN optimizations. Make recurrent_dropout = 0 to resolve the issue.

这篇关于为什么在训练数据时我的 GPU 会中断?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆