使用tf.estimator进行分布式培训,从而导致更多的培训步骤 [英] Distributed Training with tf.estimator resulting in more training steps

查看:57
本文介绍了使用tf.estimator进行分布式培训,从而导致更多的培训步骤的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在Cloud ML Engine上试用分布式培训选项,并且观察到一些奇怪的结果.我基本上已经更改了人口普查自定义估算器示例,使其包含一个略有不同的模型,并将损失函数更改为AdamOptimizer,这是唯一的实际更改.根据其他线程,我的理解是,任何分布式培训都应该是数据并行异步培训,这将建议如果在10个工作程序节点之间分发10,000批次,则每个节点大约可以处理1,000批次."在我的实验中,我有约650k训练示例,并且我针对1个时期运行以下实验,批处理大小为128.如果有650k训练示例和批处理大小为128,则我预计在一个步骤中会有〜5.1k步时代.这是我在不同的--scale-tier上看到的性能

I am experimenting with distributed training options on Cloud ML Engine and I observing some peculiar results. I have basically altered the census custom estimator example to contain a slightly different model and changed my loss function to AdamOptimizer as the only real changes. Based on this other thread, my understanding is that any distributed training should be data-parallel asynchronous training which would suggest "If you distribute 10,000 batches among 10 worker nodes, each node works on roughly 1,000 batches." In my experiment, I have ~650k training examples and I am running the following experiments for 1 epoch with a batch size of 128. Given 650k training examples and a batch size of 128, I would expect there to be ~5.1k steps in an epoch. Here is the performance that I am seeing for different --scale-tier's

未分发

  • BASIC:每秒8步,5.1k步,11分钟的挂墙时间
  • BASIC_GPU:24步/秒,5.1k步,挂墙时间3.5分钟

已分发

  • STANDARD_1:14.5步/秒-26k步(26k * 128 =〜3.3M,比数据中的实际训练样本要多得多),挂墙时间为29分钟

  • STANDARD_1: 14.5 steps/sec -- 26k steps (26k*128 = ~3.3M which is way more than the training samples actually in the data), 29 min wall time

自定义-5个complex_model_m工人,2个large_model参数服务器:27步/秒,31k步(128 * 31k =〜3.9M,比数据中实际的650k训练样本要多得多),挂墙时间20分钟

CUSTOM -- 5 complex_model_m workers, 2 large_model parameter servers: 27 steps/sec, 31k steps (128*31k = ~3.9M which is way more than the 650k training samples actually in the data), wall time 20 minutes

我的期望是,基于本文的数据并行处理是,分布式培训将把批次分配给所有工人,因此,如果我有5名工人进行约5,000个批次,那么每个工人将执行约1,000个批次.但是,我观察到的实际行为是,它似乎更接近于自己执行1个纪元的5个工人中的每个.在分布式环境中进行训练时,一个纪元采取的步骤数量是训练示例的6倍-我知道,每次更新梯度时,才是真正的步骤定义,但是我对数据并行训练的理解是这只会拆分批次,因此应该有相同数量的渐变更新-是否有任何理由可以预期这种行为?在数据并行异步培训分布式环境中需要更多的培训步骤是否有意义?谁能解释我观察到的行为?

My expectation was that the data-parallel based on the article was that the distributed training would split up the batches amongst all of the workers so if I had 5 workers on ~5k batches, then each worker would perform ~1,000 batches. However, the actual behavior that I am observing is that it seems closer to EACH of the 5 workers performing 1 epoch themselves. When training in a distributed setting, there are 6x as many steps taken in an epoch as there are training examples -- I know that the true definition of a step is each time the gradients are updated, but my understanding of data parallel training is that this would just split up the batches so there should be the same number of gradient updates -- is there any reason why this behavior would be expected? Would it make sense for there to be more train steps needed in a data-parallel asynchronous training distributed environment? Can anybody explain the behavior that I am observing?

推荐答案

上一个答案在解释性能瓶颈方面做得很好.让我解释一下时代"以及TensorFlow如何处理数据集.

The previous answer did a good job at explaining the performance bottlenecks. Let me explain about "epochs" and how TensorFlow processes datasets.

TensorFlow中分布式培训的工作方式是每个工作人员独立地遍历整个数据集.常见的误解是,培训集是在工人之间分配的,但事实并非如此.

The way that distributed training works in TensorFlow is that each worker independently iterates through the entire dataset. It is a common misconception that the training set is partitioned amongst the workers, but this is not the case.

在带有队列的典型设置中(请参见本教程),每个工人创建的内容是什么这是自己的队列.该队列中充满了所有训练文件的所有文件名的列表(通常,该列表被重新排列,并且每次队列用尽时,都会重新填充并重新排列).每个文件都是按实例读取的,数据经过解析,预处理,然后馈入另一个队列,在其中对实例进行重新排序和批处理.读取任何文件的最后一个实例后,将从文件名队列中弹出下一个文件名.如果没有其他要弹出的文件,则表明纪元"已经完成.

In a typical setup with queues (see this tutorial), what happens is each worker creates it's own queue. That queue gets filled with a list of all the filenames of all the training files (typically the list is shuffled and everytime the queue is exhausted, it gets repopulated and reshuffled). Each file is read in instance-by-instance, and the data is parsed, preprocessed, and then fed into another queue where the instances are shuffled and batched. Once the last instance of any file is read, the next filename is popped off the filename queue. If there are no more files to pop, an "epoch" has completed.

这里重要的一点是,默认情况下,所有这些队列都是 local -不共享.因此,每个工作人员都在独立地重复相同的工作-创建包含所有文件的队列并遍历整个数据集.那么,完整的时期大约等于完整数据集中的实例数*工作者数. (我不确定您的standard_1结果如何,但是CUSTOM上的结果意味着您的主数据+ 5个工作人员= 6个工作人员* 650K示例*(1批/128个示例)= 31K步).

The important point here is that all of these queues are by default local -- not shared. So every worker is independently repeating the same work -- creating queues with all the files and iterating through the entire dataset. A full epoch, then, is roughly equal to the number of instances in the full dataset * the number of workers. (I'm not sure about your standard_1 result, but the result on CUSTOM means you have your master + 5 workers = 6 workers * 650K examples * (1 batch / 128 examples) = 31K steps).

仅供参考,不建议使用历元来对分布式训练进行参数化,因为它太混乱了,甚至可能在总体上存在问题.只需坚持使用max_steps.

FYI the use of epochs are discouraged for parameterizing distributed training because it's too confusing and there may even be issues with it in general. Just stick with max_steps.

请注意,由于TensorFlow的设计,批量大小"是指每个工作人员的批量大小 .但是每个工作人员将以大致相同的速率向参数服务器发送更新,因此在大致相当于处理一个批处理"所需时间的时间段内,对参数进行的更新数量大约为 * num_workers.这就是我们所说的有效批量大小.反过来,这会带来一些实际的后果:

Note that, as a consequence of TensorFlow's design, "batch size" means the batch size of each worker. But each worker is going to be sending updates to the parameter servers at roughly the same rate, so over a time period roughly equivalent to the time need to process one "batch", the number of updates that happen to the parameters are roughly batch_size * num_workers. This is what we call the effective batch size. This in turn has a few practical consequences:

  1. 您倾向于使用较小的batch_size,尤其是在您有大量工人的情况下,以便使有效批次大小保持理智.
  2. 随着工人数量的增加,有效批次数量增加,因此,至少在使用香草"随机梯度下降法时,您需要降低学习率.具有自适应学习率的优化器(例如Adagrad,Adam等)往往对初始学习率具有鲁棒性,但是如果您添加足够的工作人员,您仍然可能需要调整学习率.
  1. You tend to use smaller batch_sizes, especially if you have a large number of workers, so that the effective batch size stays sane.
  2. As you increase the number of workers, your effective batch size increase and hence you need to decrease your learning rate, at least when using "vanilla" stochastic gradient descent. Optimizers with adaptive learning rates (such as Adagrad, Adam, etc.) tend to be robust to the initial learning rates, but if you add enough workers you still may need to adjust the learning rate.

您可能想知道为什么TensorFlow以这种方式处理训练数据.这是因为在分布式系统中,您不能依赖于相同速度甚至根本不可靠的机器.如果您将训练数据划分为各个工人的不相交的集合,然后一台或多台机器相对于另一台机器速度较慢,或者网络出现故障,等等,那么您的训练过程将从快速"/可靠的工人比慢"/不可靠的工人更频繁.这会使结果偏向那些实例(或者在极端情况下,将其全部忽略).

You may wonder why TensorFlow handles training data in this fashion. It's because in distributed systems, you can't rely on machines being the same speeds, or even being reliable at all. If you partition the training data into disjoint sets that go to each worker and then one or more machines are slow relative to the other or the network goes down on one, etc., your training process will see the data from the "fast"/reliable workers more frequently than the "slow"/unreliable workers. That biases the results towards those instances (or in extreme cases, ignores it all together).

这篇关于使用tf.estimator进行分布式培训,从而导致更多的培训步骤的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆