使用 GTX 960 在 tensorflow 中训练 cifar10 需要多长时间 [英] How long to train cifar10 in tensorflow with a GTX 960

查看:72
本文介绍了使用 GTX 960 在 tensorflow 中训练 cifar10 需要多长时间的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

其他人能告诉我他们花了多长时间在他们的机器上训练模型吗?我已经从下面的代码中发布了一些日志信息.top 显示 Python 的 cpu 使用率约为 300%,而 nvidia-smi 昨天一直显示 Volatile GPU-Util 约为 60%,但现在大约是 30%.30 小时前开始训练,现在损失一直在 0.10 左右振荡约 15 小时.我可能需要调整梯度下降的截止参数,但我希望代码能够像教程存储库中那样运行和收敛.我跟着教程这里,他们说

<块引用>

该模型实现了约 86% 准确率的峰值性能在 GPU 上训练数小时

<预><代码>>>>head -n20 nohup.out...2017-05-14 16:38:21.763013:我 tensorflow/core/common_runtime/gpu/gpu_device.cc:887] 发现设备 0 的属性:名称:GeForce GTX 960主要:5 次要:2 memoryClockRate (GHz) 1.342pciBusID 0000:01:00.0总内存:1.95GiB可用内存:1.58GiB2017-05-14 16:38:21.763029:我张量流/核心/common_runtime/gpu/gpu_device.cc:908] DMA:02017-05-14 16:38:21.763036: I tensorflow/core/common_runtime/gpu/gpu_device.cc:918] 0: Y2017-05-14 16:38:21.763044: I tensorflow/core/common_runtime/gpu/gpu_device.cc:977] 创建 TensorFlow 设备 (/gpu:0) ->(设备:0,名称:GeForce GTX 960,pci 总线 ID:0000:01:00.0)成功下载了 cifar-10-binary.tar.gz 170052171 字节.在开始训练之前用 20000 张 CIFAR 图像填充队列.这将需要几分钟.2017-05-14 16:38:36.943404:第 0 步,损失 = 4.68(83.0 个样本/秒;1.542 秒/批次)2017-05-14 16:38:37.983802:第 10 步,损失 = 4.60(1230.3 个样本/秒;0.104 秒/批次)2017-05-14 16:38:39.199938:第 20 步,损失 = 4.55(1052.5 个样本/秒;0.122 秒/批次)

解决方案

培训似乎可以随心所欲地进行.当你得到你想要的损失时,你杀死训练脚本,只要确保训练最近生成了一个检查点文件.对我来说,检查点文件在/tmp/cifar10_train 中.

首先我尝试了 kill -SIGSTOP .正如他们在教程中提到的,这没有为评估脚本留下足够的内存,所以我用 ```kill -9 终止了训练脚本.然后我运行了评估脚本并获得了他们在教程中提到的 86% 的准确率.

2017-05-15 22:39:35.574805:精度@ 1 = 0.865

Could someone else tell me how long it took them to train the model on their machine? I've posted a little bit of the logging information from the code below. top shows ~300% cpu usage for python, and nvidia-smi had been showing Volatile GPU-Util at ~60% yesterday, but now it is about 30%. Started training 30 hours ago, and the loss has been oscillating around 0.10 for about 15 hours now. I might need to tweak the cutoff parameters for the gradient descent, but I expected the code to run and converge as it was in the tutorial repo. I followed the tutorial here, where they say

This model achieves a peak performance of about 86% accuracy within a few hours of training time on a GPU

>>> head -n20 nohup.out
...
2017-05-14 16:38:21.763013: I tensorflow/core/common_runtime/gpu/gpu_device.cc:887] Found device 0 with properties: 
name: GeForce GTX 960
major: 5 minor: 2 memoryClockRate (GHz) 1.342
pciBusID 0000:01:00.0
Total memory: 1.95GiB
Free memory: 1.58GiB
2017-05-14 16:38:21.763029: I tensorflow/core/common_runtime/gpu/gpu_device.cc:908] DMA: 0 
2017-05-14 16:38:21.763036: I tensorflow/core/common_runtime/gpu/gpu_device.cc:918] 0:   Y 
2017-05-14 16:38:21.763044: I tensorflow/core/common_runtime/gpu/gpu_device.cc:977] Creating TensorFlow device (/gpu:0) -> (device: 0, name: GeForce GTX 960, pci bus id: 0000:01:00.0)

Successfully downloaded cifar-10-binary.tar.gz 170052171 bytes.
Filling queue with 20000 CIFAR images before starting to train. This will take a few minutes.
2017-05-14 16:38:36.943404: step 0, loss = 4.68 (83.0 examples/sec; 1.542 sec/batch)
2017-05-14 16:38:37.983802: step 10, loss = 4.60 (1230.3 examples/sec; 0.104 sec/batch)
2017-05-14 16:38:39.199938: step 20, loss = 4.55 (1052.5 examples/sec; 0.122 sec/batch)

解决方案

The training seems to be able to run as long as you want. You kill the training script when you get the loss you want, just make sure that the training has generated a checkpoint file recently. For me the checkpoint files were in /tmp/cifar10_train.

First I tried kill -SIGSTOP <pid>. As they mentioned in the tutorial, this didn't leave enough memory for the evaluation script, so I terminated the training script with ```kill -9 . Then I ran the evaluation script and got the 86% accuracy they mentioned in the tutorial.

2017-05-15 22:39:35.574805: precision @ 1 = 0.865

这篇关于使用 GTX 960 在 tensorflow 中训练 cifar10 需要多长时间的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆