TensorFlow - 从 TFRecords 文件中读取视频帧 [英] TensorFlow - Read video frames from TFRecords file

查看:19
本文介绍了TensorFlow - 从 TFRecords 文件中读取视频帧的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

TLDR; 我的问题是如何从 TFRecords 加载压缩的视频帧.

我正在建立一个数据管道,用于在大型视频数据集上训练深度学习模型 (动力学).为此,我使用 TensorFlow,更具体地说是 tf.data.DatasetTFRecordDataset 结构.由于数据集包含约 30 万个 10 秒的视频,因此需要处理大量数据.在训练期间,我想从视频中随机采样 64 个连续帧,因此快速随机采样很重要.为了实现这一点,在训练期间可能有多种数据加载方案:

I am setting up a data pipeline for training deep learning models on a large video dataset (Kinetics). For this I am using TensorFlow, more specifically the tf.data.Dataset and TFRecordDataset structures. As the dataset contains ~300k videos of 10 seconds, there is a large amount of data to deal with. During training, I want to randomly sample 64 consecutive frames from a video, therefore fast random sampling is important. For achieving this there are a number of data loading scenarios possible during training:

  1. 从视频中采样. 使用 ffmpegOpenCV 和示例帧加载视频.不理想,因为在视频中搜索很棘手,而且解码视频流比解码 JPG 慢得多.
  2. JPG 图像. 通过将所有视频帧提取为 JPG 来预处理数据集.这会生成大量文件,由于随机访问,这可能不会很快.
  3. 数据容器.将数据集预处理为 TFRecordsHDF5 文件.需要更多的工作来准备管道,但最有可能是这些选项中最快的.
  1. Sample from Video. Load the videos using ffmpeg or OpenCV and sample frames. Not ideal as seeking in videos is tricky, and decoding video streams is much slower than decoding JPG.
  2. JPG Images. Preprocess the dataset by extracting all video frames as JPG. This generates a huge amount of files, which is probably not going to be fast due to random access.
  3. Data Containers. Preprocess the dataset to TFRecords or HDF5 files. Requires more work getting the pipeline ready, but most likely to be the fastest of those options.

我决定采用选项 (3) 并使用 TFRecord 文件来存储数据集的预处理版本.然而,这也并不像看起来那么简单,例如:

I have decided to go for option (3) and use TFRecord files to store a preprocessed version of the dataset. However, this is also not as straightforward as it seems, for example:

  1. 压缩.将视频帧作为未压缩的字节数据存储在 TFRecords 中将需要大量磁盘空间.因此,我提取所有视频帧,应用 JPG 压缩并将压缩字节存储为 TFRecords.
  2. 视频数据.我们正在处理视频,因此 TFRecords 文件中的每个示例都将非常大,并且包含多个视频帧(通常 250-300 为 10 秒的视频,具体取决于帧速度).
  1. Compression. Storing the video frames as uncompressed byte data in TFRecords will require a huge amount of disk space. Therefore, I extract all the video frames, apply JPG compression and store the compressed bytes as TFRecords.
  2. Video Data. We are dealing with video, so each example in the TFRecords file will be quite large and contains several video frames (typically 250-300 for 10 seconds of video, depending on the frame rate).

我编写了以下代码来预处理视频数据集并将视频帧写入 TFRecord 文件(每个文件大小约为 5GB):

I have wrote the following code to preprocess the video dataset and write the video frames as TFRecord files (each of ~5GB in size):

def _int64_feature(value):
    """Wrapper for inserting int64 features into Example proto."""
    if not isinstance(value, list):
        value = [value]
    return tf.train.Feature(int64_list=tf.train.Int64List(value=value))

def _bytes_feature(value):
    """Wrapper for inserting bytes features into Example proto."""
    return tf.train.Feature(bytes_list=tf.train.BytesList(value=[value]))


with tf.python_io.TFRecordWriter(output_file) as writer:

  # Read and resize all video frames, np.uint8 of size [N,H,W,3]
  frames = ... 

  features = {}
  features['num_frames']  = _int64_feature(frames.shape[0])
  features['height']      = _int64_feature(frames.shape[1])
  features['width']       = _int64_feature(frames.shape[2])
  features['channels']    = _int64_feature(frames.shape[3])
  features['class_label'] = _int64_feature(example['class_id'])
  features['class_text']  = _bytes_feature(tf.compat.as_bytes(example['class_label']))
  features['filename']    = _bytes_feature(tf.compat.as_bytes(example['video_id']))

  # Compress the frames using JPG and store in as bytes in:
  # 'frames/000001', 'frames/000002', ...
  for i in range(len(frames)):
      ret, buffer = cv2.imencode(".jpg", frames[i])
      features["frames/{:04d}".format(i)] = _bytes_feature(tf.compat.as_bytes(buffer.tobytes()))

  tfrecord_example = tf.train.Example(features=tf.train.Features(feature=features))
  writer.write(tfrecord_example.SerializeToString())

这很好用;数据集很好地编写为 TFRecord 文件,帧为压缩的 JPG 字节.我的问题是,如何在训练期间读取 TFRecord 文件,从视频中随机抽取 64 帧并解码 JPG 图像.

This works fine; the dataset is nicely written as TFRecord files with the frames as compressed JPG bytes. My question regards, how to read the TFRecord files during training, randomly sample 64 frames from a video and decode the JPG images.

根据 TensorFlow 的文档关于 tf.Data 我们需要做类似的事情:

According to TensorFlow's documentation on tf.Data we need to do something like:

filenames = tf.placeholder(tf.string, shape=[None])
dataset = tf.data.TFRecordDataset(filenames)
dataset = dataset.map(...)  # Parse the record into tensors.
dataset = dataset.repeat()  # Repeat the input indefinitely.
dataset = dataset.batch(32)
iterator = dataset.make_initializable_iterator()
training_filenames = ["/var/data/file1.tfrecord", "/var/data/file2.tfrecord"]
sess.run(iterator.initializer, feed_dict={filenames: training_filenames})

有很多关于如何使用图像执行此操作的示例,这非常简单.但是,对于视频和帧的随机采样,我被卡住了.tf.train.Features 对象将帧存储为 frame/00001frame/000002 等.我的第一个问题是如何随机采样一个dataset.map() 函数中的一组连续帧?考虑因素是由于 JPG 压缩,每一帧都有可变数量的字节,需要使用 tf.image.decode_jpeg 进行解码.

There are many example on how to do this with images, and that is quite straightforward. However, for video and random sampling of frames I am stuck. The tf.train.Features object stores the frames as frame/00001, frame/000002 etc. My first question is how to randomly sample a set of consecutive frames from this inside the dataset.map() function? Considerations are that each frame has a variable number of bytes due to JPG compression and need to be decoded using tf.image.decode_jpeg.

任何有关如何最好地设置从 TFRecord 文件读取视频样本的帮助将不胜感激!

Any help how to best setup reading video sampels from TFRecord files would be appreciated!

推荐答案

将每个帧编码为一个单独的特性使得动态选择帧变得困难,因为 tf.parse_example() 的签名(和tf.parse_single_example()) 要求在图构建时固定解析的特征名称集.但是,您可以尝试将帧编码为包含 JPEG 编码字符串列表的单个功能:

Encoding each frame as a separate feature makes it difficult to select frames dynamically, because the signature of tf.parse_example() (and tf.parse_single_example()) requires that the set of parsed feature names be fixed at graph construction time. However, you could try encoding the frames as a single feature that contains a list of JPEG-encoded strings:

def _bytes_list_feature(values):
    """Wrapper for inserting bytes features into Example proto."""
    return tf.train.Feature(bytes_list=tf.train.BytesList(value=values))

with tf.python_io.TFRecordWriter(output_file) as writer:

  # Read and resize all video frames, np.uint8 of size [N,H,W,3]
  frames = ... 

  features = {}
  features['num_frames']  = _int64_feature(frames.shape[0])
  features['height']      = _int64_feature(frames.shape[1])
  features['width']       = _int64_feature(frames.shape[2])
  features['channels']    = _int64_feature(frames.shape[3])
  features['class_label'] = _int64_feature(example['class_id'])
  features['class_text']  = _bytes_feature(tf.compat.as_bytes(example['class_label']))
  features['filename']    = _bytes_feature(tf.compat.as_bytes(example['video_id']))

  # Compress the frames using JPG and store in as a list of strings in 'frames'
  encoded_frames = [tf.compat.as_bytes(cv2.imencode(".jpg", frame)[1].tobytes())
                    for frame in frames]
  features['frames'] = _bytes_list_feature(encoded_frames)

  tfrecord_example = tf.train.Example(features=tf.train.Features(feature=features))
  writer.write(tfrecord_example.SerializeToString())

完成此操作后,就可以使用 你的解析代码:

Once you have done this, it will be possible to slice the frames feature dynamically, using a modified version of your parsing code:

def decode(serialized_example, sess):
  # Prepare feature list; read encoded JPG images as bytes
  features = dict()
  features["class_label"] = tf.FixedLenFeature((), tf.int64)
  features["frames"] = tf.VarLenFeature(tf.string)
  features["num_frames"] = tf.FixedLenFeature((), tf.int64)

  # Parse into tensors
  parsed_features = tf.parse_single_example(serialized_example, features)

  # Randomly sample offset from the valid range.
  random_offset = tf.random_uniform(
      shape=(), minval=0,
      maxval=parsed_features["num_frames"] - SEQ_NUM_FRAMES, dtype=tf.int64)

  offsets = tf.range(random_offset, random_offset + SEQ_NUM_FRAMES)

  # Decode the encoded JPG images
  images = tf.map_fn(lambda i: tf.image.decode_jpeg(parsed_features["frames"].values[i]),
                     offsets)

  label  = tf.cast(parsed_features["class_label"], tf.int64)

  return images, label

(请注意,我无法运行您的代码,因此可能存在一些小错误,但希望足以让您入门.)

(Note that I haven't been able to run your code, so there may be some small errors, but hopefully it is enough to get you started.)

这篇关于TensorFlow - 从 TFRecords 文件中读取视频帧的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆