TensorFlow-从TFRecords文件读取视频帧 [英] TensorFlow - Read video frames from TFRecords file

查看:106
本文介绍了TensorFlow-从TFRecords文件读取视频帧的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

TLDR; 我的问题是如何从TFRecords加载压缩的视频帧。

我正在设置建立用于在大型视频数据集上训练深度学习模型的数据管道(运动学)。为此,我正在使用TensorFlow,更具体地说是 tf.data.Dataset TFRecordDataset 结构。由于数据集包含约30万个10秒的视频,因此需要处理大量数据。在训练期间,我想从视频中随机采样64个连续帧,因此快速随机采样非常重要。为此,在训练过程中可能会出现多种数据加载场景:

I am setting up a data pipeline for training deep learning models on a large video dataset (Kinetics). For this I am using TensorFlow, more specifically the tf.data.Dataset and TFRecordDataset structures. As the dataset contains ~300k videos of 10 seconds, there is a large amount of data to deal with. During training, I want to randomly sample 64 consecutive frames from a video, therefore fast random sampling is important. For achieving this there are a number of data loading scenarios possible during training:


  1. 视频样本。使用 ffmpeg OpenCV 的视频和示例帧。不理想,因为在视频中查找非常棘手,并且解码视频流比解码JPG慢得多。

  2. JPG图片。通过提取所有视频帧作为JPG。这会生成大量文件,由于随机访问,可能不会很快。

  3. 数据容器。将数据集预处理为 TFRecords HDF5 文件。需要更多的工作来准备管道,但最有可能是这些选择中最快的。

  1. Sample from Video. Load the videos using ffmpeg or OpenCV and sample frames. Not ideal as seeking in videos is tricky, and decoding video streams is much slower than decoding JPG.
  2. JPG Images. Preprocess the dataset by extracting all video frames as JPG. This generates a huge amount of files, which is probably not going to be fast due to random access.
  3. Data Containers. Preprocess the dataset to TFRecords or HDF5 files. Requires more work getting the pipeline ready, but most likely to be the fastest of those options.

我决定选择3)并使用 TFRecord 文件存储数据集的预处理版本。但是,这也不像看起来那样简单:

I have decided to go for option (3) and use TFRecord files to store a preprocessed version of the dataset. However, this is also not as straightforward as it seems, for example:


  1. 压缩。存储视频帧因为TFRecords中的未压缩字节数据将需要大量的磁盘空间。因此,我提取所有视频帧,应用JPG压缩并将压缩后的字节存储为TFRecords。

  2. 视频数据。。我们正在处理视频,因此TFRecords文件中的每个示例都将很大,并包含多个视频帧(通常为10帧为250-300帧)视频的秒数,具体取决于帧频)。

  1. Compression. Storing the video frames as uncompressed byte data in TFRecords will require a huge amount of disk space. Therefore, I extract all the video frames, apply JPG compression and store the compressed bytes as TFRecords.
  2. Video Data. We are dealing with video, so each example in the TFRecords file will be quite large and contains several video frames (typically 250-300 for 10 seconds of video, depending on the frame rate).

我编写了以下代码来预处理视频数据集并将视频帧编写为TFRecord文件(每个文件大小约为5GB) :

I have wrote the following code to preprocess the video dataset and write the video frames as TFRecord files (each of ~5GB in size):

def _int64_feature(value):
    """Wrapper for inserting int64 features into Example proto."""
    if not isinstance(value, list):
        value = [value]
    return tf.train.Feature(int64_list=tf.train.Int64List(value=value))

def _bytes_feature(value):
    """Wrapper for inserting bytes features into Example proto."""
    return tf.train.Feature(bytes_list=tf.train.BytesList(value=[value]))


with tf.python_io.TFRecordWriter(output_file) as writer:

  # Read and resize all video frames, np.uint8 of size [N,H,W,3]
  frames = ... 

  features = {}
  features['num_frames']  = _int64_feature(frames.shape[0])
  features['height']      = _int64_feature(frames.shape[1])
  features['width']       = _int64_feature(frames.shape[2])
  features['channels']    = _int64_feature(frames.shape[3])
  features['class_label'] = _int64_feature(example['class_id'])
  features['class_text']  = _bytes_feature(tf.compat.as_bytes(example['class_label']))
  features['filename']    = _bytes_feature(tf.compat.as_bytes(example['video_id']))

  # Compress the frames using JPG and store in as bytes in:
  # 'frames/000001', 'frames/000002', ...
  for i in range(len(frames)):
      ret, buffer = cv2.imencode(".jpg", frames[i])
      features["frames/{:04d}".format(i)] = _bytes_feature(tf.compat.as_bytes(buffer.tobytes()))

  tfrecord_example = tf.train.Example(features=tf.train.Features(feature=features))
  writer.write(tfrecord_example.SerializeToString())

这很好;数据集可以很好地写为TFRecord文件,其帧为压缩的JPG字节。我的问题是,在训练过程中如何读取TFRecord文件,如何从视频中随机采样64帧并解码JPG图像。

This works fine; the dataset is nicely written as TFRecord files with the frames as compressed JPG bytes. My question regards, how to read the TFRecord files during training, randomly sample 64 frames from a video and decode the JPG images.

根据 TensorFlow的文档 code> tf.Data 我们需要执行以下操作:

According to TensorFlow's documentation on tf.Data we need to do something like:

filenames = tf.placeholder(tf.string, shape=[None])
dataset = tf.data.TFRecordDataset(filenames)
dataset = dataset.map(...)  # Parse the record into tensors.
dataset = dataset.repeat()  # Repeat the input indefinitely.
dataset = dataset.batch(32)
iterator = dataset.make_initializable_iterator()
training_filenames = ["/var/data/file1.tfrecord", "/var/data/file2.tfrecord"]
sess.run(iterator.initializer, feed_dict={filenames: training_filenames})

有很多关于如何使用图像执行此操作的示例,这非常简单。但是,对于视频和帧的随机采样,我遇到了麻烦。 tf.train.Features 对象将帧存储为 frame / 00001 frame / 000002 等。我的第一个问题是如何在 dataset.map()函数中从中随机采样一组连续的帧?考虑因素是,由于JPG压缩,每个帧都有可变数量的字节,需要使用 tf.image.decode_jpeg 进行解码。

There are many example on how to do this with images, and that is quite straightforward. However, for video and random sampling of frames I am stuck. The tf.train.Features object stores the frames as frame/00001, frame/000002 etc. My first question is how to randomly sample a set of consecutive frames from this inside the dataset.map() function? Considerations are that each frame has a variable number of bytes due to JPG compression and need to be decoded using tf.image.decode_jpeg.

任何帮助如何最好地设置从TFRecord文件读取视频样本的方法将不胜感激!

Any help how to best setup reading video sampels from TFRecord files would be appreciated!

推荐答案

将每个帧编码为单独的功能,使得动态选择帧变得困难,因为的签名tf.parse_example()(和 tf.parse_single_example())要求在图构造时固定一组已解析的特征名称。但是,您可以尝试将帧编码为单个功能,其中包含JPEG编码的字符串列表:

Encoding each frame as a separate feature makes it difficult to select frames dynamically, because the signature of tf.parse_example() (and tf.parse_single_example()) requires that the set of parsed feature names be fixed at graph construction time. However, you could try encoding the frames as a single feature that contains a list of JPEG-encoded strings:

def _bytes_list_feature(values):
    """Wrapper for inserting bytes features into Example proto."""
    return tf.train.Feature(bytes_list=tf.train.BytesList(value=values))

with tf.python_io.TFRecordWriter(output_file) as writer:

  # Read and resize all video frames, np.uint8 of size [N,H,W,3]
  frames = ... 

  features = {}
  features['num_frames']  = _int64_feature(frames.shape[0])
  features['height']      = _int64_feature(frames.shape[1])
  features['width']       = _int64_feature(frames.shape[2])
  features['channels']    = _int64_feature(frames.shape[3])
  features['class_label'] = _int64_feature(example['class_id'])
  features['class_text']  = _bytes_feature(tf.compat.as_bytes(example['class_label']))
  features['filename']    = _bytes_feature(tf.compat.as_bytes(example['video_id']))

  # Compress the frames using JPG and store in as a list of strings in 'frames'
  encoded_frames = [tf.compat.as_bytes(cv2.imencode(".jpg", frame)[1].tobytes())
                    for frame in frames]
  features['frames'] = _bytes_list_feature(encoded_frames)

  tfrecord_example = tf.train.Example(features=tf.train.Features(feature=features))
  writer.write(tfrecord_example.SerializeToString())

完成此操作后,可以使用修改后的< a href = https://gist.github.com/tomrunia/7ef5d40639f2ae41fb71d3352a701e4a rel = nofollow noreferrer>您的解析代码:

Once you have done this, it will be possible to slice the frames feature dynamically, using a modified version of your parsing code:

def decode(serialized_example, sess):
  # Prepare feature list; read encoded JPG images as bytes
  features = dict()
  features["class_label"] = tf.FixedLenFeature((), tf.int64)
  features["frames"] = tf.VarLenFeature(tf.string)
  features["num_frames"] = tf.FixedLenFeature((), tf.int64)

  # Parse into tensors
  parsed_features = tf.parse_single_example(serialized_example, features)

  # Randomly sample offset from the valid range.
  random_offset = tf.random_uniform(
      shape=(), minval=0,
      maxval=parsed_features["num_frames"] - SEQ_NUM_FRAMES, dtype=tf.int64)

  offsets = tf.range(random_offset, random_offset + SEQ_NUM_FRAMES)

  # Decode the encoded JPG images
  images = tf.map_fn(lambda i: tf.image.decode_jpeg(parsed_features["frames"].values[i]),
                     offsets)

  label  = tf.cast(parsed_features["class_label"], tf.int64)

  return images, label

(请注意,我无法运行您的代码,因此可能一些小错误,但希望它足以使您入门。)

(Note that I haven't been able to run your code, so there may be some small errors, but hopefully it is enough to get you started.)

这篇关于TensorFlow-从TFRecords文件读取视频帧的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆