使用 feed_dict 比使用数据集 API 快 5 倍多? [英] Using feed_dict is more than 5x faster than using dataset API?

查看:19
本文介绍了使用 feed_dict 比使用数据集 API 快 5 倍多?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我创建了一个 TFRecord 格式的数据集用于测试.每个条目包含 200 列,名为 C1 - C199,每列都是一个字符串列表,还有一个 label 列来表示标签.创建数据的代码可以在这里找到:https://github.com/codescv/tf-dist/blob/8bb3c44f55939fc66b3727a730c57887113e899c/src/gen_data.py#L25

然后我使用了一个线性模型来训练数据.第一种方法如下所示:

dataset = tf.data.TFRecordDataset(data_file)数据集 = dataset.prefetch(buffer_size=batch_size*10)数据集 = dataset.map(parse_tfrecord, num_parallel_calls=5)数据集 = dataset.repeat(num_epochs)数据集 = dataset.batch(batch_size)特征,标签 = dataset.make_one_shot_iterator().get_next()logits = tf.feature_column.linear_model(features=features, feature_columns=columns, cols_to_vars=cols_to_vars)train_op = ...使用 tf.Session() 作为 sess:sess.run(train_op)

完整代码可以在这里找到:https://github.com/codescv/tf-dist/blob/master/src/lr_single.py

当我运行上面的代码时,我得到 0.85 步/秒(批量大小为 1024).

在第二种方法中,我手动将批次从 Dataset 导入 python,然后将它们提供给占位符,如下所示:

example = tf.placeholder(dtype=tf.string, shape=[None])特征= tf.parse_example(例如,特征=tf.feature_column.make_parse_example_spec(列+[tf.feature_column.numeric_column('标签',dtype=tf.float32,default_value=0)]))标签 = features.pop('标签')train_op = ...数据集 = tf.data.TFRecordDataset(data_file).repeat().batch(batch_size)next_batch = dataset.make_one_shot_iterator().get_next()使用 tf.Session() 作为 sess:data_batch = sess.run(next_batch)sess.run(train_op,feed_dict={示例:data_batch})

完整代码可以在这里找到:https://github.com/codescv/tf-dist/blob/master/src/lr_single_feed.py

当我运行上面的代码时,我得到了 5 步/秒.这比第一种方法快 5 倍.这是我不明白的,因为理论上第二个应该更慢,因为数据批的额外序列化/反序列化.

谢谢!

解决方案

当前(从 TensorFlow 1.9 开始)在使用 tf.data 映射和批处理具有大的张量时存在性能问题每个特征具有少量数据的数量.该问题有两个原因:

  1. dataset.map(parse_tfrecord, ...) 转换将执行 O(batch_size * num_columns) 次小操作以创建一个批次.相比之下,将 tf.placeholder() 提供给 tf.parse_example() 将执行 O(1) 操作以创建相同的批处理.

  2. 使用 dataset.batch() 批处理多个 tf.SparseTensor 对象比直接创建相同的 tf.SparseTensor 慢得多作为 tf.parse_example() 的输出.

正在对这两个问题进行改进,并且应该会在 TensorFlow 的未来版本中提供.同时,您可以通过切换 dataset.map()dataset.batch() 并重写 dataset.map() 以处理字符串向量,例如基于馈送的版本:

dataset = tf.data.TFRecordDataset(data_file)数据集 = dataset.prefetch(buffer_size=batch_size*10)数据集 = dataset.repeat(num_epochs)# 首先批量创建一个字符串向量作为map()的输入.数据集 = dataset.batch(batch_size)def parse_tfrecord_batch(record_batch):特征 = tf.parse_example(记录批次,特征=tf.feature_column.make_parse_example_spec(列 + [tf.feature_column.numeric_column('标签',dtype=tf.float32,default_value=0)]))标签 = features.pop('标签')返回特征、标签# 注意:Parallelism 可能没有那么有用,因为单个 map 函数现在可以# 每次调用更多的工作,但你可能想尝试一下.数据集 = dataset.map(parse_tfrecord_batch)# 在管道执行的末尾添加一个预取.数据集 = dataset.prefetch(1)特征,标签 = dataset.make_one_shot_iterator().get_next()# ...

<小时>

EDIT (2018/6/18):从评论中回答您的问题:

<块引用>

  1. 为什么 dataset.map(parse_tfrecord, ...) O(batch_size * num_columns),而不是 O(batch_size)?如果解析需要枚举列,为什么 parse_example 不采用 O(num_columns)?

当您将 TensorFlow 代码包装在 Dataset.map()(或其他函数转换)中时,每个输出都会添加恒定数量的额外操作以从函数和(在tf.SparseTensor 值的情况)将它们转换"为标准格式.当您将 tf.parse_example() 的输出直接传递给模型的输入时,不会添加这些操作.虽然它们是非常小的操作,但执行如此多的操作可能会成为瓶颈.(技术上解析确实需要 O(batch_size * num_columns) 时间,但是解析中涉及的常量很多小于执行操作.)

<块引用>

  1. 为什么要在管道末端添加预取?

当您对性能感兴趣时,这几乎总是最好的做法,它应该可以提高管道的整体性能.有关最佳做法的更多信息,请参阅 tf.data 的性能指南.

I created a dataset in TFRecord format for testing. Every entry contains 200 columns, named C1 - C199, each being a strings list, and a label column to denote the labels. The code to create the data can be found here: https://github.com/codescv/tf-dist/blob/8bb3c44f55939fc66b3727a730c57887113e899c/src/gen_data.py#L25

Then I used a linear model to train the data. The first approach looks like this:

dataset = tf.data.TFRecordDataset(data_file)
dataset = dataset.prefetch(buffer_size=batch_size*10)
dataset = dataset.map(parse_tfrecord, num_parallel_calls=5)
dataset = dataset.repeat(num_epochs)
dataset = dataset.batch(batch_size)

features, labels = dataset.make_one_shot_iterator().get_next()    
logits = tf.feature_column.linear_model(features=features, feature_columns=columns, cols_to_vars=cols_to_vars)
train_op = ...

with tf.Session() as sess:
    sess.run(train_op)

The full code can be found here: https://github.com/codescv/tf-dist/blob/master/src/lr_single.py

When I run the code above, I get 0.85 steps/sec (batch size being 1024).

In the second approach, I manually get batches from Dataset into python, then feed them to a placeholder, like this:

example = tf.placeholder(dtype=tf.string, shape=[None])
features = tf.parse_example(example, features=tf.feature_column.make_parse_example_spec(columns+[tf.feature_column.numeric_column('label', dtype=tf.float32, default_value=0)]))
labels = features.pop('label')
train_op = ...

dataset = tf.data.TFRecordDataset(data_file).repeat().batch(batch_size)
next_batch = dataset.make_one_shot_iterator().get_next()

with tf.Session() as sess:
    data_batch = sess.run(next_batch)
    sess.run(train_op, feed_dict={example: data_batch})

The full code can be found here: https://github.com/codescv/tf-dist/blob/master/src/lr_single_feed.py

When I run the code above, I get 5 steps/sec. That is 5x faster than the first approach. This is what I do not understand, because theoretically the second should be slower due to the extra serialization/deserialization of data batches.

Thanks!

解决方案

There is currently (as of TensorFlow 1.9) a performance issue when using tf.data to map and batch tensors that have a large number of features with a small amount of data in each. The issue has two causes:

  1. The dataset.map(parse_tfrecord, ...) transformation will execute O(batch_size * num_columns) small operations to create a batch. By contrast, feeding a tf.placeholder() to tf.parse_example() will execute O(1) operations to create the same batch.

  2. Batching many tf.SparseTensor objects using dataset.batch() is much slower than directly creating the same tf.SparseTensor as the output of tf.parse_example().

Improvements to both these issues are underway, and should be available in a future version of TensorFlow. In the meantime, you can improve the performance of the tf.data-based pipeline by switching the order of the dataset.map() and dataset.batch() and rewriting the dataset.map() to work on a vector of strings, like the feeding based version:

dataset = tf.data.TFRecordDataset(data_file)
dataset = dataset.prefetch(buffer_size=batch_size*10)
dataset = dataset.repeat(num_epochs)

# Batch first to create a vector of strings as input to the map(). 
dataset = dataset.batch(batch_size)

def parse_tfrecord_batch(record_batch):
  features = tf.parse_example(
      record_batch,
      features=tf.feature_column.make_parse_example_spec(
          columns + [
              tf.feature_column.numeric_column(
                  'label', dtype=tf.float32, default_value=0)]))
  labels = features.pop('label')
  return features, labels

# NOTE: Parallelism might not be as useful, because the individual map function now does
# more work per invocation, but you might want to experiment with this.
dataset = dataset.map(parse_tfrecord_batch)

# Add a prefetch at the end to pipeline execution.
dataset = dataset.prefetch(1)

features, labels = dataset.make_one_shot_iterator().get_next()    
# ...


EDIT (2018/6/18): To answer your questions from the comments:

  1. Why is dataset.map(parse_tfrecord, ...) O(batch_size * num_columns), not O(batch_size)? If parsing requires enumeration of the columns, why doesn't parse_example take O(num_columns)?

When you wrap TensorFlow code in a Dataset.map() (or other functional transformation) a constant number of extra operations per output are added to "return" values from the function and (in the case of tf.SparseTensor values) "convert" them to a standard format. When you directly pass the outputs of tf.parse_example() to the input of your model, these operations aren't added. While they are very small operations, executing so many of them can become a bottleneck. (Technically the parsing does take O(batch_size * num_columns) time, but the constants involved in parsing are much smaller than executing an operation.)

  1. Why do you add a prefetch at the end of the pipeline?

When you're interested in performance, this is almost always the best thing to do, and it should improve the overall performance of your pipeline. For more information about best practices, see the performance guide for tf.data.

这篇关于使用 feed_dict 比使用数据集 API 快 5 倍多?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆