TFRecordReader 似乎非常慢,多线程读取不起作用 [英] TFRecordReader seems extremely slow , and multi-threads reading not working
问题描述
我的训练过程使用 tfrecord 格式的训练和评估数据集.
我测试了 reader 的基准,只有 8000 条记录/秒.和 io 速度(见 iotop 命令)仅为 400KB-500KB/s.
我在这里使用的是 cpp 版本的 protobuf
每个出队操作大约需要 60 微秒,但平均有 5 个并行运行,因此每个出队可以获得 12 微秒.因此,这意味着在最佳情况下,您每秒将获得 <20 万个示例.
My training process use tfrecord format for train&eval dataset.
I test the benchmark of reader , only 8000records/second. and io speed(see from iotop command) just 400KB-500KB/s.
I'm using the cpp version of protobuf here
If possible, provide a minimal reproducible example (We usually don't have time to read hundreds of lines of your code)
def read_and_decode(filename_queue):
reader = tf.TFRecordReader()
_, serialized_example = reader.read(filename_queue)
return serialized_example
serialized_example = read_and_decode(filename_queue)
batch_serialized_example = tf.train.shuffle_batch(
[serialized_example],
batch_size=batch_size,
num_threads=thread_number,
capacity=capacity,
min_after_dequeue=min_after_dequeue)
features = tf.parse_example(
batch_serialized_example,
features={
"label": tf.FixedLenFeature([], tf.float32),
"ids": tf.VarLenFeature(tf.int64),
"values": tf.VarLenFeature(tf.float32),
})
What other attempted solutions have you tried?
I try to set num_threads in tf.train.shuffle_batch but not working.
It seems that when set to 2 threads, it work at 8000records/s, when enlarge the thread number, it get slower. (I remove all ops that cost cpus. Just read data.)
My sever are 24 core cpus.
The issue here is that there's a fixed cost overhead to each session.run
, and filling the queue with many tiny examples to the queue will be slow.
In particular, each session.run is about 100-200 usec, so you can only do about 5k-10k session.run
calls per second.
This problem is obvious if doing Python profiling (python -m cProfile), but hard to see if starting from timeline profile, or CPU profile.
The work-around is to use enqueue_many
to add things to your queue in batches. I took your benchmark from https://gist.github.com/ericyue/7705407a88e643f7ab380c6658f641e8 and modified it to enqueue many items per .run
call, and that gives 10x speed-up.
The modification is to modify tf.batch
call as follows:
if enqueue_many:
reader = tf.TFRecordReader(options = tf.python_io.TFRecordOptions(tf.python_io.TFRecordCompressionType.ZLIB))
queue_batch = []
for i in range(enqueue_many_size):
_, serialized_example = reader.read(filename_queue)
queue_batch.append(serialized_example)
batch_serialized_example = tf.train.shuffle_batch(
[queue_batch],
batch_size=batch_size,
num_threads=thread_number,
capacity=capacity,
min_after_dequeue=min_after_dequeue,
enqueue_many=True)
For complete source, check here: https://github.com/yaroslavvb/stuff/blob/master/ericyue-slowreader/benchmark.py
It's hard to optimize it to go much faster since now most of the time is spent in queue operations. Looking at stripped down version which just adds integers to a queue, you also get similar speed, and looking at timeline, time is spent in dequeue ops.
Each dequeue op takes about 60 usec, but there's on average 5 runnning in parallel, so you get 12 usec per dequeue. So that means you'll get <200k examples per second in the best case.
这篇关于TFRecordReader 似乎非常慢,多线程读取不起作用的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!