Tensorflow:无法在服务器中创建会话 [英] Tensorflow: failed to create session in server

查看:32
本文介绍了Tensorflow:无法在服务器中创建会话的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在 Keras 中开发了一个模型并对其进行了多次训练.一旦我强行停止模型的训练,从那以后我收到以下错误:

I developed a model in Keras and trained it quite a few times. Once I forcefully stopped the training of the model and since then I am getting the following error:

Traceback (most recent call last):
  File "inception_resnet.py", line 246, in <module>
    callbacks=[checkpoint, saveEpochNumber])   ##
  File "/home/eh0/E27890/anaconda3/lib/python3.5/site-packages/keras/legacy/interfaces.py", line 87, in wrapper
    return func(*args, **kwargs)
  File "/home/eh0/E27890/anaconda3/lib/python3.5/site-packages/keras/engine/training.py", line 2042, in fit_generator
    class_weight=class_weight)
  File "/home/eh0/E27890/anaconda3/lib/python3.5/site-packages/keras/engine/training.py", line 1762, in train_on_batch
    outputs = self.train_function(ins)
  File "/home/eh0/E27890/anaconda3/lib/python3.5/site-packages/keras/backend/tensorflow_backend.py", line 2270, in __call__
    session = get_session()
  File "/home/eh0/E27890/anaconda3/lib/python3.5/site-packages/keras/backend/tensorflow_backend.py", line 163, in get_session
    _SESSION = tf.Session(config=config)
  File "/home/eh0/E27890/anaconda3/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1486, in __init__
    super(Session, self).__init__(target, graph, config=config)
  File "/home/eh0/E27890/anaconda3/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 621, in __init__
    self._session = tf_session.TF_NewDeprecatedSession(opts, status)
  File "/home/eh0/E27890/anaconda3/lib/python3.5/contextlib.py", line 66, in __exit__
    next(self.gen)
  File "/home/eh0/E27890/anaconda3/lib/python3.5/site-packages/tensorflow/python/framework/errors_impl.py", line 466, in raise_exception_on_not_ok_status
    pywrap_tensorflow.TF_GetCode(status))
tensorflow.python.framework.errors_impl.InternalError: Failed to create session.

所以错误实际上是

tensorflow.python.framework.errors_impl.InternalError:未能创建会话.

tensorflow.python.framework.errors_impl.InternalError: Failed to create session.

最有可能的是,GPU 内存仍然被占用.我什至无法创建一个简单的 tensorflow 会话.

And most probably, the GPU memory is still occupied. I can't even create a simple tensorflow session.

我在这里看到了一个答案,但是当我在终端中执行以下命令时

I have seen an answer here, but when I execute the following command in terminal

export CUDA_VISIBLE_DEVICES=''

在没有 GPU 加速的情况下开始模型训练.

training of the model gets started without GPU acceleration.

此外,由于我正在服务器上训练我的模型,而且我也没有对服务器的 root 访问权限,因此我无法通过 root 访问权限重新启动服务器或清除 GPU 内存.现在有什么解决办法?

Also, as I am training my model on a server and I have no root access either to the server, I can't restart the server or clear GPU memory with root access. What is the solution now?

推荐答案

我在 这个问题.

nvidia-smi -q

这给出了占用 GPU 内存的所有进程(及其 PID)的列表.我用

This gives a list of all the processes (and their PIDs) occupying GPU memory. I killed them one by one by using

kill -9 PID

现在一切都恢复正常了.

Now everything is running smooth again.

这篇关于Tensorflow:无法在服务器中创建会话的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆