为什么这个 tensorflow 训练需要这么长时间? [英] Why is this tensorflow training taking so long?

查看:102
本文介绍了为什么这个 tensorflow 训练需要这么长时间?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在通过深度强化学习实战一书学习 DRL.在第 3 章中,他们介绍了简单的游戏 Gridworld(

作为练习,我已将代码迁移到 tensorflow.所有代码都在

训练损失数字似乎是正确的,也是获胜率(我们必须考虑到游戏是随机的,可能有不可能的状态).问题是整个流程的表现.

我做错了什么,但是什么?

主要区别在于训练循环,在火炬中是这样的:

 loss_fn = torch.nn.MSELoss()学习率 = 1e-3优化器 = torch.optim.Adam(model.parameters(), lr=learning_rate)....Q1 = 模型(state1_batch)使用 torch.no_grad():Q2 = model2(state2_batch) #BY = Reward_batch + gamma * ((1-done_batch) * torch.max(Q2,dim=1)[0])X = Q1.gather(dim=1,index=action_batch.long().unsqueeze(dim=1)).squeeze()损失 = loss_fn(X, Y.detach())optimizer.zero_grad()损失.向后()优化器.step()

在 tensorflow 版本中:

 loss_fn = tf.keras.losses.MSE学习率 = 1e-3优化器 = tf.keras.optimizers.Adam(learning_rate)...Q2 = model2(state2_batch) #B使用 tf.GradientTape() 作为磁带:Q1 = 模型(state1_batch)Y = Reward_batch + gamma * ((1-done_batch) * tf.math.reduce_max(Q2, axis=1))X = [Q1[i][action_batch[i]] for i in range(len(action_batch))]损失 = loss_fn(X, Y)grads = tape.gradient(损失,model.trainable_variables)optimizer.apply_gradients(zip(grads, model.trainable_variables))

为什么培训需要这么长时间?

解决方案

为什么 TensorFlow 速度慢

TensorFlow 有 2 种执行模式:急切执行和图模式.TensorFlow 默认行为,从版本 2 开始,默认为 Eager Execution.Eager Execution 很棒,因为它使您能够编写接近于编写标准 Python 的代码.编写起来更容易,调试起来也更容易.不幸的是,它确实不如图形模式快.

所以这个想法是,一旦函数在 Eager 模式下原型化,让 TensorFlow 在图形模式下执行它.为此,您可以使用 tf.function.tf.function 将可调用对象编译为 TensorFlow 图.一旦函数被编译成图形,性能增益通常是非常重要的.在 TensorFlow 中开发时推荐的方法如下:

<块引用>
  • 在eager模式下调试,然后用@tf.function进行修饰.
  • 不要依赖 Python 的副作用,例如对象突变或列表追加.
  • tf.function 与 TensorFlow ops 配合使用效果最佳;NumPy 和 Python 调用被转换为常量.

我想补充一点:想想你的程序的关键部分,哪些应该首先转换成图形模式.它通常是您调用模型以获得结果的部分.在这里,您会看到最好的改进.

您可以在以下指南中找到更多信息:

tf.function 应用于您的代码

因此,您至少可以在代码中更改两件事以使其运行得更快:

  1. 第一个是不要在少量数据上使用model.predict.该函数用于处理庞大的数据集或生成器.(请参阅 Github 上的此评论).相反,您应该直接调用模型,为了提高性能,您可以将对该模型的调用包装在 tf.function 中.

<块引用>

Model.predict 是一个顶级 API,设计用于在任何循环之外进行批量预测,具有 Keras API 的全部功能.

  1. 第二个是使您的训练步骤成为一个单独的函数,并使用 @tf.function 修饰该函数.

因此,我会在您的训练循环之前声明以下内容:

# 代替 model.predict 调用model_func = tf.function(model)def get_train_func(model, model2, loss_fn, 优化器):"包装器,它使用通过""的两个模型创建一个训练步骤.@tf.functiondef train_func(state1_batch, state2_batch, done_batch, reward_batch, action_batch):Q2 = model2(state2_batch) #B使用 tf.GradientTape() 作为磁带:Q1 = 模型(state1_batch)Y = Reward_batch + gamma * ((1-done_batch) * tf.math.reduce_max(Q2, axis=1))# gather 比列表推导更有效,并且需要在 tf.function 中X = tf.gather(Q1, action_batch, batch_dims=1)损失 = loss_fn(X, Y)grads = tape.gradient(损失,model.trainable_variables)optimizer.apply_gradients(zip(grads, model.trainable_variables))回波损耗返回 train_func# 训练步骤是可调用的train_step = get_train_func(model, model2, loss_fn, 优化器)

您可以在训练循环中使用该函数:

if len(replay) >批量大小:minibatch = random.sample(重放,batch_size)state1_batch = np.array([s1 for (s1,a,r,s2,d) in minibatch]).reshape((batch_size, 64))action_batch = np.array([a for (s1,a,r,s2,d) in minibatch]) #TODO: Posibles差异Reward_batch = np.float32([r for (s1,a,r,s2,d) in minibatch])state2_batch = np.array([s2 for (s1,a,r,s2,d) in minibatch]).reshape((batch_size, 64))done_batch = np.array([d for (s1,a,r,s2,d) in minibatch]).astype(np.float32)损失 = train_step(state1_batch,state2_batch,done_batch,reward_batch,action_batch)损失.追加(损失)

您还可以进行其他更改以使您的代码更加 TensorFlowesque,但是通过这些修改,您的代码在我的 CPU 上需要大约 2 分钟的时间.(赢率为 97%).

I'm learning DRL with the book Deep Reinforcement Learning in Action. In chapter 3, they present the simple game Gridworld (instructions here, in the rules section) with the corresponding code in PyTorch.

I've experimented with the code and it takes less than 3 minutes to train the network with 89% of wins (won 89 of 100 games after training).

As an exercise, I have migrated the code to tensorflow. All the code is here.

The problem is that with my tensorflow port it takes near 2 hours to train the network with a win rate of 84%. Both versions are using the only CPU to train (I don't have GPU)

Training loss figures seem correct and also the rate of a win (we have to take into consideration that the game is random and can have impossible states). The problem is the performance of the overall process.

I'm doing something terribly wrong, but what?

The main differences are in the training loop, in torch is this:

        loss_fn = torch.nn.MSELoss()
        learning_rate = 1e-3
        optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate)
        ....
        Q1 = model(state1_batch) 
        with torch.no_grad():
            Q2 = model2(state2_batch) #B
        
        Y = reward_batch + gamma * ((1-done_batch) * torch.max(Q2,dim=1)[0])
        X = Q1.gather(dim=1,index=action_batch.long().unsqueeze(dim=1)).squeeze()
        loss = loss_fn(X, Y.detach())
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

and in the tensorflow version:

        loss_fn = tf.keras.losses.MSE
        learning_rate = 1e-3
        optimizer = tf.keras.optimizers.Adam(learning_rate)
        ...
        Q2 = model2(state2_batch) #B
        with tf.GradientTape() as tape:
            Q1 = model(state1_batch)
            Y = reward_batch + gamma * ((1-done_batch) * tf.math.reduce_max(Q2, axis=1))
            X = [Q1[i][action_batch[i]] for i in range(len(action_batch))]
            loss = loss_fn(X, Y)
        grads = tape.gradient(loss, model.trainable_variables)
        optimizer.apply_gradients(zip(grads, model.trainable_variables))

Why is the training taking so long?

解决方案

Why is TensorFlow slow

TensorFlow has 2 execution modes: eager execution, and graph mode. TensorFlow default behavior, since version 2, is to default to eager execution. Eager execution is great as it enables you to write code close to how you would write standard python. It's easier to write, and it's easier to debug. Unfortunately, it's really not as fast as graph mode.

So the idea is, once the function is prototyped in eager mode, to make TensorFlow execute it in graph mode. For that you can use tf.function. tf.function compiles a callable into a TensorFlow graph. Once the function is compiled into a graph, the performance gain is usually quite important. The recommended approach when developing in TensorFlow is the following:

  • Debug in eager mode, then decorate with @tf.function.
  • Don't rely on Python side effects like object mutation or list appends.
  • tf.function works best with TensorFlow ops; NumPy and Python calls are converted to constants.

I would add: think about the critical parts of your program, and which ones should be converted first into graph mode. It's usually the parts where you call a model to get a result. It's where you will see the best improvements.

You can find more information in the following guides:

Applying tf.function to your code

So, there are at least two things you can change in your code to make it run quite faster:

  1. The first one is to not use model.predict on a small amount of data. The function is made to work on a huge dataset or on a generator. (See this comment on Github). Instead, you should call the model directly, and for performance enhancement, you can wrap the call to the model in a tf.function.

Model.predict is a top-level API designed for batch-predicting outside of any loops, with the fully-features of the Keras APIs.

  1. The second one is to make your training step a separate function, and to decorate that function with @tf.function.

So, I would declare the following things before your training loop:

# to call instead of model.predict
model_func = tf.function(model)

def get_train_func(model, model2, loss_fn, optimizer):
    """Wrapper that creates a train step using the two model passed"""
    @tf.function
    def train_func(state1_batch, state2_batch, done_batch, reward_batch, action_batch):
        Q2 = model2(state2_batch) #B
        with tf.GradientTape() as tape:
            Q1 = model(state1_batch)
            Y = reward_batch + gamma * ((1-done_batch) * tf.math.reduce_max(Q2, axis=1))
            # gather is more efficient than a list comprehension, and needed in a tf.function
            X = tf.gather(Q1, action_batch, batch_dims=1)
            loss = loss_fn(X, Y)
        grads = tape.gradient(loss, model.trainable_variables)
        optimizer.apply_gradients(zip(grads, model.trainable_variables))
        return loss
    return train_func

# train step is a callable 
train_step = get_train_func(model, model2, loss_fn, optimizer)

And you can use that function in your training loop:

if len(replay) > batch_size:
    minibatch = random.sample(replay, batch_size)
    state1_batch = np.array([s1 for (s1,a,r,s2,d) in minibatch]).reshape((batch_size, 64))
    action_batch = np.array([a for (s1,a,r,s2,d) in minibatch])   #TODO: Posibles diferencies
    reward_batch = np.float32([r for (s1,a,r,s2,d) in minibatch])
    state2_batch = np.array([s2 for (s1,a,r,s2,d) in minibatch]).reshape((batch_size, 64))
    done_batch = np.array([d for (s1,a,r,s2,d) in minibatch]).astype(np.float32)

    loss = train_step(state1_batch, state2_batch, done_batch, reward_batch, action_batch)
    losses.append(loss)

There are other changes that you could make to make your code more TensorFlowesque, but with those modifications, your code takes ~2 minutes on my CPU. (with a 97% win rate).

这篇关于为什么这个 tensorflow 训练需要这么长时间?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆