使用 tensorflow_datasets API 访问已下载的数据集 [英] Accessing already downloaded dataset with tensorflow_datasets API

查看:26
本文介绍了使用 tensorflow_datasets API 访问已下载的数据集的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试使用最近发布的 tensorflow_dataset API 在开放图像数据集上训练 Keras 模型.数据集大小约为 570 GB.我用以下代码下载了数据:

I am trying to work with the quite recently published tensorflow_dataset API to train a Keras model on the Open Images Dataset. The dataset is about 570 GB in size. I downloaded the data with the following code:

import tensorflow_datasets as tfds
import tensorflow as tf

open_images_dataset = tfds.image.OpenImagesV4()
open_images_dataset.download_and_prepare(download_dir="/notebooks/dataset/")

下载完成后,与我的 jupyter notebook 的连接不知何故中断了,但提取似乎也完成了,至少所有下载的文件在已提取"文件夹中都有对应的文件.但是,我现在无法访问下载的数据:

After the download was complete, the connection to my jupyter notebook somehow interrupted but the extraction seemed to be finished as well, at least all downloaded files had a counterpart in the "extracted" folder. However, I am not able to access the downloaded data now:

tfds.load(name="open_images_v4", data_dir="/notebooks/open_images_dataset/extracted/", download=False)

这只会给出以下错误:

AssertionError: Dataset open_images_v4: could not find data in /notebooks/open_images_dataset/extracted/. Please make sure to call dataset_builder.download_and_prepare(), or pass download=True to tfds.load() before trying to access the tf.data.Dataset object.

当我调用函数 download_and_prepare() 时,它只会再次下载整个数据集.

When I call the function download_and_prepare() it only downloads the whole dataset again.

我错过了什么吗?

下载后extracted"下的文件夹有18个.tar.gz文件.

After the download the folder under "extracted" has 18 .tar.gz files.

推荐答案

这是 tensorflow-datasets 1.0.1 和 tensorflow 2.0.

This is with tensorflow-datasets 1.0.1 and tensorflow 2.0.

文件夹层次结构应该是这样的:

The folder hierarchy should be like this:

/notebooks/open_images_dataset/extracted/open_images_v4/0.1.0

/notebooks/open_images_dataset/extracted/open_images_v4/0.1.0

所有数据集都有一个版本.那么数据就可以这样加载了.

All the datasets have a version. Then the data could be loaded like this.

ds = tf.load('open_images_v4', data_dir='/notebooks/open_images_dataset/extracted', download=False)

我没有 open_images_v4 数据.我将 cifar10 数据放入名为 open_images_v4 的文件夹中,以检查 tensorflow_datasets 期望的文件夹结构.

I didn't have open_images_v4 data. I put cifar10 data into a folder named open_images_v4 to check what folder structure tensorflow_datasets was expecting.

这篇关于使用 tensorflow_datasets API 访问已下载的数据集的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆