每当训练模型时,内核都会死掉 [英] Kernel died restarting whenever training a model

查看:35
本文介绍了每当训练模型时,内核都会死掉的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

代码如下:

# 导入库从 keras.models 导入顺序从 keras.layers 导入 Conv2D从 keras.layers 导入 MaxPooling2D从 keras.layers 导入扁平化从 keras.layers 导入密集# 导入数据集从 keras.preprocessing.image 导入 ImageDataGeneratortrain_datagen = ImageDataGenerator()test_datagen = ImageDataGenerator()training_set = train_datagen.flow_from_directory('数据/频谱图/ensemble_de_formation',target_size = (64, 64),批量大小 = 128,class_mode = '二进制')test_set = test_datagen.flow_from_directory('data/spectrogramme/ensemble_de_test',target_size = (64, 64),批量大小 = 128,class_mode = '二进制')# 初始化reseau = 顺序()#1. 卷积reseau.add(Conv2D(32, (3, 3), input_shape = (64, 64, 3), activation = 'relu'))reseau.add(MaxPooling2D(pool_size = (2, 2)))reseau.add(Conv2D(32, (3, 3), activation = 'relu'))reseau.add(MaxPooling2D(pool_size = (2, 2)))reseau.add(Conv2D(64, (3, 3), activation = 'relu'))reseau.add(MaxPooling2D(pool_size = (2, 2)))reseau.add(Conv2D(64, (3, 3), activation = 'relu'))reseau.add(MaxPooling2D(pool_size = (2, 2)))# 2. 展平reseau.add(Flatten())# 3. 全连接从 keras.layers 导入 Dropoutreseau.add(密集(单位= 64,激活= 'relu'))reseau.add(辍学(0.1))reseau.add(密集(单位= 128,激活= 'relu'))reseau.add(辍学(0.05))reseau.add(密集(单位= 256,激活= 'relu'))reseau.add(辍学(0.03))reseau.add(密集(单位= 1,激活= 'sigmoid'))# 4. 编译reseau.compile(optimizer = 'adam', loss = 'binary_crossentropy', metrics = ['accuracy'])# 5. 适合reseau.fit_generator(training_set,steps_per_epoch = 8000,epochs = 1,验证数据 = 测试集,验证步骤 = 2000)

这应该证明我有安装了 CUDA 和 CUDNN 的 tensorflow GPU pic

我不知道该怎么办,我已经多次重新安装 CUDA 和 CUDNN

但是,如果我卸载 tensorflow-gpu,程序将完美运行......除了每个 epoch 需要 5000 秒......我想避免这种情况

仅供参考,这一切都发生在 Windows 上

感谢任何帮助.

解决方案

tensorflow-gpu 的一个非常麻烦的问题.我花了几天时间才找到最有效的解决方案.

似乎是什么问题:

我知道您在观看 YouTube 视频或互联网文档后可能已经安装了 cudnn 和 cuda(就像我一样).但是由于 cuda 和 cudnn 对版本冲突非常严格,因此您的 tensorflow 、 cuda 或 cudnn 版本之间可能存在版本不匹配.

解决方案是什么:

tensorflow-gpu 2.3 在安装 tensorflow-gpu 2.3 期间由 Anaconda 在 Windows 10 上自动选择的 tensorflow 构建似乎有问题.请在此处找到解决方法(如果你有一个 GitHub 账户).

Python 3.7:conda install tensorflow-gpu=2.3 tensorflow=2.3=mkl_py37h936c3e2_0

Python 3.8:conda install tensorflow-gpu=2.3 tensorflow=2.3=mkl_py38h1fcfbd6_0

这些片段会自动下载 cuda 和 cudnn 驱动程序以及 tensorflow-gpu.试用此解决方案后,我能够fit() tensorflow 模型以及由于安装了 GPU 而提高了速度.

忠告:

如果您从事机器学习/数据科学工作.我强烈建议你转向 anaconda 而不是 pip.这将允许您创建虚拟环境并与 jupyter-notebooks 轻松集成.您可以为机器学习任务创建单独的虚拟环境,因为它们通常需要升级或降级库.在虚拟环境中,它不会损害环境之外的其他包.

Here's the code:

# import libraries
from keras.models import Sequential
from keras.layers import Conv2D
from keras.layers import MaxPooling2D
from keras.layers import Flatten
from keras.layers import Dense

# import dataset
from keras.preprocessing.image import ImageDataGenerator

train_datagen = ImageDataGenerator()

test_datagen = ImageDataGenerator()

training_set = train_datagen.flow_from_directory(
                                            'data/spectrogramme/ensemble_de_formation',
                                            target_size = (64, 64),
                                            batch_size = 128,
                                            class_mode = 'binary')

test_set = test_datagen.flow_from_directory('data/spectrogramme/ensemble_de_test',
                                            target_size = (64, 64),
                                            batch_size = 128,
                                            class_mode = 'binary')

# initializing
reseau = Sequential()

# 1. convolution
reseau.add(Conv2D(32, (3, 3), input_shape = (64, 64, 3), activation = 'relu'))
reseau.add(MaxPooling2D(pool_size = (2, 2)))
reseau.add(Conv2D(32, (3, 3), activation = 'relu'))
reseau.add(MaxPooling2D(pool_size = (2, 2)))
reseau.add(Conv2D(64, (3, 3), activation = 'relu'))
reseau.add(MaxPooling2D(pool_size = (2, 2)))
reseau.add(Conv2D(64, (3, 3), activation = 'relu'))
reseau.add(MaxPooling2D(pool_size = (2, 2)))

# 2. flatenning
reseau.add(Flatten())

# 3. fully connected
from keras.layers import Dropout
reseau.add(Dense(units = 64, activation = 'relu'))
reseau.add(Dropout(0.1))
reseau.add(Dense(units = 128, activation = 'relu'))
reseau.add(Dropout(0.05))
reseau.add(Dense(units = 256, activation = 'relu'))
reseau.add(Dropout(0.03))
reseau.add(Dense(units = 1, activation = 'sigmoid'))

# 4. compile
reseau.compile(optimizer = 'adam', loss = 'binary_crossentropy', metrics = ['accuracy'])

# 5. fit
reseau.fit_generator(training_set, steps_per_epoch = 8000, epochs = 1,
                     validation_data = test_set, validation_steps = 2000)

This should prove that I have tensorflow GPU with CUDA and CUDNN installed pic

I don't know what to do, I have reinstalled CUDA and CUDNN multiple times

HOWEVER, if I uninstall tensorflow-gpu, the program runs flawlessly... with the exception of needing 5000 seconds per epoch... I'd like to avoid that

FYI, this is all happening on Windows

Any help is appreciated.

解决方案

A very cumbersome issue with tensorflow-gpu. It took me days to find the best working solution.

What seems to be the problem:

I know you might have installed cudnn and cuda (just like me) after watching youtube videos or internet documentation. But since cuda and cudnn are very strict about version clashes so it's possible that there might have been a version mismatch between your tensorflow , cuda or cudnn version.

What's the solution:

The tensorflow build automatically selected by Anaconda on Windows 10 during the installation of tensorflow-gpu 2.3 seems to be faulty. Please find a workaround here (consider upvoting the GitHub answer if you have a GitHub account).

Python 3.7: conda install tensorflow-gpu=2.3 tensorflow=2.3=mkl_py37h936c3e2_0

Python 3.8: conda install tensorflow-gpu=2.3 tensorflow=2.3=mkl_py38h1fcfbd6_0

These snippets automatically download cuda and cudnn drivers along with the tensorflow-gpu. After trying out this solution i was able to fit() the tensorflow models as well as boost up the speed due to GPU installed.

A word of advice:

If you are working with machine learning / data science. I would strongly advice you shift to anaconda instead of pip. This would allow you to create virtual environments and easy integration with jupyter-notebooks. You can create a separate virtual environment for machine learning tasks as they often require upgradation or downgradation of libraries. With virtual environments it won't hurt your other packages outside the environment.

这篇关于每当训练模型时,内核都会死掉的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆