将 tf.Dataset 拆分为测试和验证子集的规范方法是什么? [英] What is the canonical way to split tf.Dataset into test and validation subsets?

查看：66 发布时间：2021/9/5 20:17:18 python python-3.x tensorflow2.0

本文介绍了将 tf.Dataset 拆分为测试和验证子集的规范方法是什么?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在关注关于如何使用纯图像加载图像的 Tensorflow 2 教程Tensorflow，因为它应该比 Keras 更快.本教程在展示如何将结果数据集 (~tf.Dataset) 拆分为训练和验证数据集之前结束.

I was following a Tensorflow 2 tutorial on how to load images with pure Tensorflow, because it is supposed to be faster than with Keras. The tutorial ends before showing how to split the resulting dataset (~tf.Dataset) into a train and validation dataset.

我检查了参考 tf.Dataset 并且它不包含 split() 方法.

I checked the reference for tf.Dataset and it does not contain a split() method.

我尝试手动切片，但 tf.Dataset 既不包含 size() 也不包含 length() 方法，所以我不知道如何自己切片.

I tried slicing it manually but tf.Dataset neither contains a size() nor a length() method, so I don't see how I could slice it myself.

我不能使用 Model.fit() 的 validation_split 参数，因为我需要扩充训练数据集而不是验证数据集.

I can't use the validation_split argument of Model.fit() because I need to augment the training dataset but not the validation dataset.

拆分 tf.Dataset 的预期方法是什么，还是应该使用不同的工作流程而不必这样做?

What is the intended way to split a tf.Dataset or should I use a different workflow where I won't have to do this?

(来自教程)

BATCH_SIZE = 32
IMG_HEIGHT = 224
IMG_WIDTH = 224


list_ds = tf.data.Dataset.list_files(str(data_dir/'*/*'))


def get_label(file_path):
  # convert the path to a list of path components
  parts = tf.strings.split(file_path, os.path.sep)
  # The second to last is the class-directory
  return parts[-2] == CLASS_NAMES


def decode_img(img):
  # convert the compressed string to a 3D uint8 tensor
  img = tf.image.decode_jpeg(img, channels=3)
  # Use `convert_image_dtype` to convert to floats in the [0,1] range.
  img = tf.image.convert_image_dtype(img, tf.float32)
  # resize the image to the desired size.
  return tf.image.resize(img, [IMG_WIDTH, IMG_HEIGHT])


def process_path(file_path):
  label = get_label(file_path)
  # load the raw data from the file as a string
  img = tf.io.read_file(file_path)
  img = decode_img(img)
  return img, label


labeled_ds = list_ds.map(process_path, num_parallel_calls=AUTOTUNE)
#...
#...

我可以拆分 list_ds(文件列表)或 labeled_ds(图像和标签列表)，但如何拆分?

I can either split list_ds (list of files) or labeled_ds (list of images and labels), but how?

推荐答案

我不认为有规范的方式(通常，数据被拆分，例如在单独的目录中).但这里有一个方法可以让你动态地做到这一点:

I don't think there's a canonical way (typically, data is being split e.g. in separate directories). But here's a recipe that will let you do it dynamically:

# Caveat: cache list_ds, otherwise it will perform the directory listing twice.
ds = list_ds.cache()

# Add some indices.
ds = ds.enumerate()

# Do a rougly 70-30 split.
train_list_ds = ds.filter(lambda i, data: i % 10 < 7)
test_list_ds = ds.filter(lambda i, data: i % 10 >= 7)

# Drop indices.
train_list_ds = train_list_ds.map(lambda i, data: data)
test_list_ds = test_list_ds.map(lambda i, data: data)

这篇关于将 tf.Dataset 拆分为测试和验证子集的规范方法是什么?的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

将 tf.Dataset 拆分为测试和验证子集的规范方法是什么? [英] What is the canonical way to split tf.Dataset into test and validation subsets?

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

将 tf.Dataset 拆分为测试和验证子集的规范方法是什么? [英] What is the canonical way to split tf.Dataset into test and validation subsets?

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭