TFRecordReader 似乎非常慢,多线程读取不起作用 [英] TFRecordReader seems extremely slow , and multi-threads reading not working

查看:32
本文介绍了TFRecordReader 似乎非常慢,多线程读取不起作用的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我的训练过程使用 tfrecord 格式的训练和评估数据集.

我测试了 reader 的基准,只有 8000 条记录/秒.和 io 速度(见 iotop 命令)仅为 400KB-500KB/s.

我在这里使用的是 cpp 版本的 protobuf

每个出队操作大约需要 60 微秒,但平均有 5 个并行运行,因此每个出队可以获得 12 微秒.因此,这意味着在最佳情况下,您每秒将获得 <20 万个示例.

My training process use tfrecord format for train&eval dataset.

I test the benchmark of reader , only 8000records/second. and io speed(see from iotop command) just 400KB-500KB/s.

I'm using the cpp version of protobuf here

https://github.com/tensorflow/tensorflow/blob/master/tensorflow/g3doc/get_started/os_setup.md#protobuf-library-related-issues

If possible, provide a minimal reproducible example (We usually don't have time to read hundreds of lines of your code)

def read_and_decode(filename_queue):
     reader = tf.TFRecordReader()
    _, serialized_example = reader.read(filename_queue)
    return serialized_example
  serialized_example = read_and_decode(filename_queue)
  batch_serialized_example = tf.train.shuffle_batch(
      [serialized_example],
      batch_size=batch_size,
      num_threads=thread_number,
      capacity=capacity,
      min_after_dequeue=min_after_dequeue)
  features = tf.parse_example(
      batch_serialized_example,
      features={
          "label": tf.FixedLenFeature([], tf.float32),
          "ids": tf.VarLenFeature(tf.int64),
          "values": tf.VarLenFeature(tf.float32),
      })

What other attempted solutions have you tried?

I try to set num_threads in tf.train.shuffle_batch but not working.

It seems that when set to 2 threads, it work at 8000records/s, when enlarge the thread number, it get slower. (I remove all ops that cost cpus. Just read data.)

My sever are 24 core cpus.

解决方案

The issue here is that there's a fixed cost overhead to each session.run, and filling the queue with many tiny examples to the queue will be slow.

In particular, each session.run is about 100-200 usec, so you can only do about 5k-10k session.run calls per second.

This problem is obvious if doing Python profiling (python -m cProfile), but hard to see if starting from timeline profile, or CPU profile.

The work-around is to use enqueue_many to add things to your queue in batches. I took your benchmark from https://gist.github.com/ericyue/7705407a88e643f7ab380c6658f641e8 and modified it to enqueue many items per .run call, and that gives 10x speed-up.

The modification is to modify tf.batch call as follows:

if enqueue_many:
    reader = tf.TFRecordReader(options = tf.python_io.TFRecordOptions(tf.python_io.TFRecordCompressionType.ZLIB))
    queue_batch = []
    for i in range(enqueue_many_size):
        _, serialized_example = reader.read(filename_queue)
        queue_batch.append(serialized_example)
    batch_serialized_example = tf.train.shuffle_batch(
        [queue_batch],
        batch_size=batch_size,
        num_threads=thread_number,
        capacity=capacity,
        min_after_dequeue=min_after_dequeue,
        enqueue_many=True)

For complete source, check here: https://github.com/yaroslavvb/stuff/blob/master/ericyue-slowreader/benchmark.py

It's hard to optimize it to go much faster since now most of the time is spent in queue operations. Looking at stripped down version which just adds integers to a queue, you also get similar speed, and looking at timeline, time is spent in dequeue ops.

Each dequeue op takes about 60 usec, but there's on average 5 runnning in parallel, so you get 12 usec per dequeue. So that means you'll get <200k examples per second in the best case.

这篇关于TFRecordReader 似乎非常慢,多线程读取不起作用的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆