使用tensorflow_datasets API访问已下载的数据集 [英] Accessing already downloaded dataset with tensorflow_datasets API
问题描述
我正在尝试使用最近发布的tensorflow_dataset API来在开放图像数据集上训练Keras模型.该数据集的大小约为570 GB.我使用以下代码下载了数据:
I am trying to work with the quite recently published tensorflow_dataset API to train a Keras model on the Open Images Dataset. The dataset is about 570 GB in size. I downloaded the data with the following code:
import tensorflow_datasets as tfds
import tensorflow as tf
open_images_dataset = tfds.image.OpenImagesV4()
open_images_dataset.download_and_prepare(download_dir="/notebooks/dataset/")
下载完成后,与我的jupyter笔记本电脑的连接以某种方式中断了,但提取似乎也已完成,至少所有下载的文件在提取的"文件夹中都有对应的文件.但是,我现在无法访问下载的数据:
After the download was complete, the connection to my jupyter notebook somehow interrupted but the extraction seemed to be finished as well, at least all downloaded files had a counterpart in the "extracted" folder. However, I am not able to access the downloaded data now:
tfds.load(name="open_images_v4", data_dir="/notebooks/open_images_dataset/extracted/", download=False)
这只会产生以下错误:
AssertionError: Dataset open_images_v4: could not find data in /notebooks/open_images_dataset/extracted/. Please make sure to call dataset_builder.download_and_prepare(), or pass download=True to tfds.load() before trying to access the tf.data.Dataset object.
当我调用函数download_and_prepare()时,它只会再次下载整个数据集.
When I call the function download_and_prepare() it only downloads the whole dataset again.
我在这里想念东西吗?
下载后,解压缩"下的文件夹包含18个.tar.gz文件.
After the download the folder under "extracted" has 18 .tar.gz files.
推荐答案
这是带有tensorflow数据集1.0.1和tensorflow 2.0的.
This is with tensorflow-datasets 1.0.1 and tensorflow 2.0.
文件夹层次结构应如下所示:
The folder hierarchy should be like this:
/notebooks/open_images_dataset/extracted/open_images_v4/0.1.0
/notebooks/open_images_dataset/extracted/open_images_v4/0.1.0
所有数据集都有一个版本.然后可以像这样加载数据.
All the datasets have a version. Then the data could be loaded like this.
ds = tf.load('open_images_v4', data_dir='/notebooks/open_images_dataset/extracted', download=False)
我没有open_images_v4数据.我将cifar10数据放入名为open_images_v4的文件夹中,以检查tensorflow_datasets期望的文件夹结构.
I didn't have open_images_v4 data. I put cifar10 data into a folder named open_images_v4 to check what folder structure tensorflow_datasets was expecting.
这篇关于使用tensorflow_datasets API访问已下载的数据集的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!