使用贝叶斯优化对深度学习结构进行超参数优化 [英] Hyperparameter optimization for Deep Learning Structures using Bayesian Optimization

查看:53
本文介绍了使用贝叶斯优化对深度学习结构进行超参数优化的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我为原始信号分类任务构建了一个 CLDNN(卷积、LSTM、深度神经网络)结构.

每个训练时期运行大约 90 秒,超参数似乎很难优化.

我一直在研究优化超参数的各种方法(例如随机或网格搜索),并发现了贝叶斯优化.

虽然我还没有完全理解优化算法,但我喜欢它会对我有很大帮助.

我想问几个关于优化任务的问题.

  1. 如何针对深度网络设置贝叶斯优化?(我们要优化的成本函数是什么?)
  2. 我要优化的功能是什么?是 N epoch 后验证集的成本吗?
  3. 留兰香是否是这项任务的良好起点?对此任务还有其他建议吗?

如果您对此问题有任何见解,我将不胜感激.

解决方案

虽然我还没有完全理解优化算法,我喜欢它会对我有很大帮助.

首先,让我简要解释一下这部分.贝叶斯优化方法旨在处理

真正的函数是[-10, 10]区间上的f(x) = x * sin(x)(黑色曲线).红点代表每次试验,红色曲线是GPmean,蓝色曲线是平均值加减一个标准差.如您所见,GP 模型并非处处都与真正的函数匹配,但优化器相当快地识别了 -8 周围的热"区域并开始利用它.

<块引用>

如何设置关于深度的贝叶斯优化网络?

在这种情况下,空间由(可能转换的)超参数定义,通常是多维单位超立方体.

例如,假设您有三个超参数:学习率 α in [0.001, 0.01],正则化器 λ in [0.1, 1](两者都是连续的)和隐藏层大小N in [50..100](整数).优化空间是一个3维立方体[0, 1]*[0, 1]*[0, 1].这个立方体中的每个点(p0, p1, p2)通过以下变换对应一个三位(α, λ, N):

p0 ->α = 10**(p0-3)p1 -​​>λ = 10**(p1-1)p2 ->N = int(p2*50 + 50)

<块引用>

我要优化的功能是什么?是不是费用N epoch 后的验证集?

正确,目标函数是神经网络验证准确率.显然,每次评估都是昂贵的,因为它至少需要几个 epoch 来训练.

另请注意,目标函数是随机的,即同一点上的两次评估可能略有不同,但这并不是贝叶斯优化的障碍,尽管它明显增加了不确定性.

<块引用>

留兰香是否是这项任务的良好起点?任何其他对此任务的建议?

spearmint 是一个很好的库,你绝对可以使用它.我也可以推荐 hyperopt.

在我自己的研究中,我最终编写了自己的小库,主要有两个原因:我想编写精确的贝叶斯方法来使用(特别是,我发现了一个 投资组合策略 UCB 和 PI 的收敛速度比其他任何东西都快,就我而言);另外还有一种技术可以节省高达 50% 的训练时间,称为 学习曲线预测(当优化器确信模型的学习速度不如其他领域时,这个想法是跳过完整的学习周期).我不知道有什么库可以实现这一点,所以我自己编写了代码,最后它得到了回报.如果您有兴趣,代码位于 GitHub 上.

I have constructed a CLDNN (Convolutional, LSTM, Deep Neural Network) structure for raw signal classification task.

Each training epoch runs for about 90 seconds and the hyperparameters seems to be very difficult to optimize.

I have been research various ways to optimize the hyperparameters (e.g. random or grid search) and found out about Bayesian Optimization.

Although I am still not fully understanding the optimization algorithm, I feed like it will help me greatly.

I would like to ask few questions regarding the optimization task.

  1. How do I set up the Bayesian Optimization with regards to a deep network?(What is the cost function we are trying to optimize?)
  2. What is the function I am trying to optimize? Is it the cost of the validation set after N epochs?
  3. Is spearmint a good starting point for this task? Any other suggestions for this task?

I would greatly appreciate any insights into this problem.

解决方案

Although I am still not fully understanding the optimization algorithm, I feed like it will help me greatly.

First up, let me briefly explain this part. Bayesian Optimization methods aim to deal with exploration-exploitation trade off in the multi-armed bandit problem. In this problem, there is an unknown function, which we can evaluate in any point, but each evaluation costs (direct penalty or opportunity cost), and the goal is to find its maximum using as few trials as possible. Basically, the trade off is this: you know the function in a finite set of points (of which some are good and some are bad), so you can try an area around the current local maximum, hoping to improve it (exploitation), or you can try a completely new area of space, that can potentially be much better or much worse (exploration), or somewhere in between.

Bayesian Optimization methods (e.g. PI, EI, UCB), build a model of the target function using a Gaussian Process (GP) and at each step choose the most "promising" point based on their GP model (note that "promising" can be defined differently by different particular methods).

Here's an example:

The true function is f(x) = x * sin(x) (black curve) on [-10, 10] interval. Red dots represent each trial, red curve is the GP mean, blue curve is the mean plus or minus one standard deviation. As you can see, the GP model doesn't match the true function everywhere, but the optimizer fairly quickly identified the "hot" area around -8 and started to exploit it.

How do I set up the Bayesian Optimization with regards to a deep network?

In this case, the space is defined by (possibly transformed) hyperparameters, usually a multidimensional unit hypercube.

For example, suppose you have three hyperparameters: a learning rate α in [0.001, 0.01], the regularizer λ in [0.1, 1] (both continuous) and the hidden layer size N in [50..100] (integer). The space for optimization is a 3-dimensional cube [0, 1]*[0, 1]*[0, 1]. Each point (p0, p1, p2) in this cube corresponds to a trinity (α, λ, N) by the following transformation:

p0 -> α = 10**(p0-3)
p1 -> λ = 10**(p1-1)
p2 -> N = int(p2*50 + 50)

What is the function I am trying to optimize? Is it the cost of the validation set after N epochs?

Correct, the target function is neural network validation accuracy. Clearly, each evaluation is expensive, because it requires at least several epochs for training.

Also note that the target function is stochastic, i.e. two evaluations on the same point may slightly differ, but it's not a blocker for Bayesian Optimization, though it obviously increases the uncertainty.

Is spearmint a good starting point for this task? Any other suggestions for this task?

spearmint is a good library, you can definitely work with that. I can also recommend hyperopt.

In my own research, I ended up writing my own tiny library, basically for two reasons: I wanted to code exact Bayesian method to use (in particular, I found a portfolio strategy of UCB and PI converged faster than anything else, in my case); plus there is another technique that can save up to 50% of training time called learning curve prediction (the idea is to skip full learning cycle when the optimizer is confident the model doesn't learn as fast as in other areas). I'm not aware of any library that implements this, so I coded it myself, and in the end it paid off. If you're interested, the code is on GitHub.

这篇关于使用贝叶斯优化对深度学习结构进行超参数优化的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆