云ML引擎分布式培训默认类型为自定义tf.estimator [英] Cloud ML Engine distributed training default type for custom tf.estimator

查看:182
本文介绍了云ML引擎分布式培训默认类型为自定义tf.estimator的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

这篇文章建议分布式培训有三种选择:
$ b


  1. 同步更新的数据并行训练

  2. 数据 - 并行训练与异步更新。

  3. 模型并行训练
  4. 建议下面的代码在Cloud ML Engine上执行与数据并行的异步更新培训,其行为如果在10个工作节点中分发10,000批次,每个节点的工作量大约为1,000批。



    然而,并不清楚代码的哪一部分实际指定了这是使用异步更新的数据并行训练。如果你使用自定义的tf.estimator在分布式训练模式下运行它,这仅仅是ML引擎的默认值吗?

    答案是 tf.estimator 目前主要是围绕数据并行训练(2)构建的。

    在您的代码中使用与tf.device()语句进行平行培训。



    您可以尝试使用 SyncReplicasOptimizer 并可能完成同步培训(1)。

    以上所有内容通常适用于 tf.estimator ; CloudML Engine没有什么不同。


    This article suggests there are three options for distributed training

    1. Data-parallel training with synchronous updates.
    2. Data-parallel training with asynchronous updates.
    3. Model-parallel training.

    The tutorial then goes on to suggest that the code that follows performs data-parallel training with asynchronous updates on Cloud ML Engine which behaves as "If you distribute 10,000 batches among 10 worker nodes, each node works on roughly 1,000 batches."

    However, it's not clear what portion of the code actually specifies that this is using data-parallel training with asynchronous updates. Is this simply the default for ML engine if you run it in distributed training mode with a custom tf.estimator?

    解决方案

    The short answer is that tf.estimator is currently mostly built around Data-parallel training (2).

    You get Model-parallel training simply by using with tf.device() statements in your code.

    You could try to use SyncReplicasOptimizer and probably accomplish synchronous training (1).

    All of the above applies generally to tf.estimator; nothing is different for CloudML Engine.

    这篇关于云ML引擎分布式培训默认类型为自定义tf.estimator的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆