如何加快"ImageFolder"的访问速度用于ImageNet [英] How to speed up the "ImageFolder" for ImageNet
问题描述
我在大学里,所有文件系统都在远程系统中,无论我用我的帐户登录到哪里,都可以访问我的主目录.即使我通过SSH命令登录到GPU服务器.这就是我使用GPU服务器读取数据的条件.
I am in an university, and all the file system are in a remote system, wherever I log in with my account, I could aways access my home directory. even though I log into the GPU servers through SSH command. This is the condition where I employ the GPU servers to read data.
当前,我使用PyTorch在ImageNet上从头开始训练ResNet,我的代码仅使用同一台计算机上的所有GPU,我发现"torchvision.datasets.ImageFolder"将花费近两个小时.
Currently, I use the PyTorch to train ResNet from scratch on ImageNet, my codes only use all the GPUs in the same computer, I found that the "torchvision.datasets.ImageFolder" will take almost two hours.
请提供一些有关如何加速"torchvision.datasets.ImageFolder"的经验吗?非常感谢.
Would you please provide some experiences in how to speed up "torchvision.datasets.ImageFolder"? Thanks very much.
推荐答案
为什么要花这么长时间?
设置 ImageFolder
可以时间长,尤其是当图像存储在慢速的远程磁盘上时.此延迟的原因是数据集的 __ init __
函数遍历了图像文件夹中的所有文件,并检查该文件是否为图像文件.对于ImageNet而言,可能需要花费相当长的时间,因为要检查的文件超过100万个.
Why it takes so long?
Setting up an ImageFolder
can take a long time, especially when the images are stored on a slow remote disk. The reason for this latency is that the __init__
function for the dataset goes over all files in the image folders and check whether this file is an image file. For ImageNet that can take quite a while as there are over 1 million files to check.
您能做什么?
-正如 Kevin Sun 所指出的那样,将数据集存储到本地(可能更快)可以大大加快处理速度.
-另外,您可以创建一个修改后的数据集类,该数据集类不读取所有文件,而是依赖文件的缓存列表-仅在其中缓存一次 的缓存列表前进并用于所有运行.
What can you do?
- As Kevin Sun already pointed out, copying the dataset to a local (and possibly much faster) storage can significantly speed up things.
- Alternatively, you can create a modified dataset class that does not read all the files, but relies on a cached list of files - a cached list that you prepare only once in advance and to be used for all runs.
这篇关于如何加快"ImageFolder"的访问速度用于ImageNet的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!