使用 tensorflow 合并在 2 台不同计算机上训练的同一模型的权重 [英] Merge weights of same model trained on 2 different computers using tensorflow

查看:41
本文介绍了使用 tensorflow 合并在 2 台不同计算机上训练的同一模型的权重的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在研究使用 tensorflow 训练深度神经网络.我知道如何训练模型.我的问题是我必须在具有不同数据集的 2 台不同计算机上训练相同的模型.然后保存模型权重.后来我必须以某种方式合并 2 个模型权重文件.我不知道如何合并它们.是否有执行此操作的函数,或者是否应该对权重进行平均?

I was doing some research on training deep neural networks using tensorflow. I know how to train a model. My problem is i have to train the same model on 2 different computers with different datasets. Then save the model weights. Later i have to merge the 2 model weight files somehow. I have no idea how to merge them. Is there a function that does this or should the weights be averaged?

对这个问题的任何帮助都是有用的

Any help on this problem would be useful

提前致谢

推荐答案

最好在训练期间合并权重更新(梯度)并保留一组公共权重,而不是在单个训练完成后尝试合并权重.两个单独训练的网络可能会找到不同的最优值,例如平均权重可能会使网络在两个数据集上的表现都更差.

It is better to merge weight updates (gradients) during the training and keep a common set of weights rather than trying to merge the weights after individual trainings have completed. Both individually trained networks may find a different optimum and e.g. averaging the weights may give a network which performs worse on both datasets.

您可以做两件事:

  1. 看看数据并行训练":将训练过程的前向和后向传递分布在多个计算节点上,每个节点都有整个数据的一个子集.

在这种情况下通常:

  • 每个节点通过网络向前传播一个小批量
  • 每个节点通过网络向后传播损失梯度
  • 主节点"从所有节点上的小批量中收集梯度并相应地更新权重
  • 并将权重更新分发回计算节点,以确保每个节点都具有相同的权重集

(上面有一些变体,以避免计算节点空闲太长时间等待其他人的结果).以上假设计算节点上运行的 Tensorflow 进程在训练期间可以相互通信.

(there are variants of the above to avoid that compute nodes idle too long waiting for results from others). The above assumes that Tensorflow processes running on the compute nodes can communicate with each other during the training.

查看 https://www.tensorflow.org/deploy/distributed)有关如何在多个节点上训练网络的更多详细信息和示例.

Look at https://www.tensorflow.org/deploy/distributed) for more details and an example of how to train networks over multiple nodes.

  1. 如果你真的单独训练网络,看看集成,参见例如此页面:https://mlwave.com/kaggle-ensembling-guide/.简而言之,您可以在自己的机器上训练各个网络,然后例如使用两个网络输出的平均值或最大值作为组合分类器/预测器.
  1. If you really have train the networks separately, look at ensembling, see e.g. this page: https://mlwave.com/kaggle-ensembling-guide/ . In a nutshell, you would train individual networks on their own machines and then e.g. use an average or maximum over the outputs of both networks as a combined classifier / predictor.

这篇关于使用 tensorflow 合并在 2 台不同计算机上训练的同一模型的权重的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆