什么是“学习率预热"?意思是? [英] What does "learning rate warm-up" mean?

查看:339
本文介绍了什么是“学习率预热"?意思是?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在机器学习(尤其是深度学习)中,预热意味着什么?

In machine learning, especially deep learning, what does it mean to warm-up?

我听过几次,在某些模型中,热身是训练的一个阶段.但说实话,我不知道这是什么,因为我是ML的新手.到目前为止,我从未使用过或接触过它,但是我想知道它,因为我认为它可能对我有用.所以:

I've heard some times that in some models, warming-up is a phase in training. but honestly, I don't know what it is because I'm very new to ML. Until now I've never used or come across it, but I want to know it because I think it might be useful for me. so:

什么是学习率预热?何时需要?

What is learning rate warm-up and when do we need it?

提前谢谢.

推荐答案

如果您的数据集具有高度差异性,则可能会遇到过早拟合"的问题.如果混洗后的数据恰好包含一组相关的,功能强大的观察结果,则模型的初始训练可能会严重偏向那些功能-或更糟糕的是,偏向于与主题完全不相关的附带功能.

If your data set is highly differentiated, you can suffer from a sort of "early over-fitting". If your shuffled data happens to include a cluster of related, strongly-featured observations, your model's initial training can skew badly toward those features -- or worse, toward incidental features that aren't truly related to the topic at all.

热身是减少早期训练示例的首要效应的一种方法.如果没有它,您可能需要运行一些额外的时间来获得所需的收敛性,因为该模型将不训练那些早期的迷信.

Warm-up is a way to reduce the primacy effect of the early training examples. Without it, you may need to run a few extra epochs to get the convergence desired, as the model un-trains those early superstitions.

许多模型将其作为命令行选项提供.在预热期间,学习率呈线性增加.如果目标学习率是p并且预热时间是n,则第一批迭代使用1*p/n作为其学习率;否则,第一批迭代将使用1*p/n作为其学习率.第二个使用2*p/n,依此类推:迭代i使用i*p/n,直到我们在迭代n达到标称速率为止.

Many models afford this as a command-line option. The learning rate is increased linearly over the warm-up period. If the target learning rate is p and the warm-up period is n, then the first batch iteration uses 1*p/n for its learning rate; the second uses 2*p/n, and so on: iteration i uses i*p/n, until we hit the nominal rate at iteration n.

这意味着第一次迭代仅获得主要效果的1/n.这样可以合理地平衡这种影响.

This means that the first iteration gets only 1/n of the primacy effect. This does a reasonable job of balancing that influence.

请注意,加速通常约为一个时期-但对于特别偏斜的数据,有时会更长,而对于更均匀的分布,则可能会短一些.您可能需要调整,具体取决于将混洗算法应用于训练集时批次的功能极端性.

Note that the ramp-up is commonly on the order of one epoch -- but is occasionally longer for particularly skewed data, or shorter for more homogeneous distributions. You may want to adjust, depending on how functionally extreme your batches can become when the shuffling algorithm is applied to the training set.

这篇关于什么是“学习率预热"?意思是?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆