为什么keras模型预测编译后会变慢? [英] Why does keras model predict slower after compile?

查看:751
本文介绍了为什么keras模型预测编译后会变慢?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

理论上,由于权重具有固定大小,因此预测应该是恒定的.如何在编译后恢复速度(无需删除优化器)?

请参阅相关实验: https://nbviewer.jupyter.org/github/off99555/TensorFlowExperiments/blob/master/test-prediction-speed-after-compile.ipynb?flush_cache=true

解决方案

更新-1/15/2020 :当前小批量的最佳实践应该是将输入直接输入模型-即preds = model(x),如果各层在训练/推断时的行为不同,则为model(x, training=False).每次最新提交,现在已记录.

我还没有对它们进行基准测试,但是根据 Git讨论,这也是值得的尝试predict_on_batch()-特别是在TF 2.1中进行了改进.


终极罪犯:self._experimental_run_tf_function = True.它是 experimental .但这实际上还不错.

向任何TensorFlow开发人员阅读:清理代码.一团糟.而且它违反了重要的编码惯例,例如一个函数可以做一件事_process_inputs所做的 lot 多于过程输入",与_standardize_user_data相同. 我的报酬不够"-但是您要做付出的代价是,花费更多的时间来理解自己的东西,并且用户在您的问题"页面中填充易于使用更清晰的代码解决的错误.


摘要:对于compile()来说,它仅慢了 个.

compile()设置内部标志,该标志为predict分配不同的预测功能.此函数在每次调用时构造一个新图形,相对于未编译,它会减慢其速度.但是,只有当训练时间比数据处理时间短得多时,差异才显着.如果我们增加模型大小至至少中等大小,则两者将相等.请参阅底部的代码.

数据处理时间的这种轻微增加远远超过了放大图形功能所能弥补的.由于仅保留一个模型图更为有效,因此放弃了一个预编译. 不过:如果您的模型相对于数据而言较小,那么最好不要使用compile()进行模型推断.请参阅我的其他答案以找到解决方法.


我应该怎么做?

比较我在底部的代码中比较编译后的模型性能与未编译的模型性能.

  • 编译速度更快:在编译后的模型上运行predict.
  • 编译速度较慢:在未编译的模型上运行predict.

是的,两者都是可能的,这取决于(1)数据大小; (2)型号尺寸; (3)硬件.底部的代码实际上显示了 compiled 模型更快,但是10次迭代只是一个小样本.请参阅我的其他答案中的解决方法",以获取操作方法".


详细信息:

这花了一些时间进行调试,但很有趣.下面,我描述了我发现的关键罪魁祸首,并引用了一些相关文档,并显示了导致最终瓶颈的分析器结果.

(为简洁起见,FLAG == self.experimental_run_tf_function)

    默认情况下,
  1. ModelFLAG=False实例化. compile()将其设置为True.
  2. predict()涉及获取预测函数func = self._select_training_loop(x)
  3. 没有传递给predictcompile的任何特殊kwarg,其他所有标志都如下所示:
    • (A) FLAG==True-> func = training_v2.Loop()
    • (B) FLAG==False-> func = training_arrays.ArrayLikeTrainingLoop()
  4. 来自源代码文档字符串(A)非常依赖图,使用更多的分发策略,并且操作倾向于创建&销毁图形元素,可能"(确实)影响性能.

真正的罪魁祸首:_process_inputs(),占运行时的 81%.它的主要成分? _create_graph_function() 运行时的72%.对于(B),此方法甚至都不存在.但是,使用中型模型,_process_inputs少于运行时的1%.代码位于底部,并提供概要分析结果.


数据处理器:

(A):<class 'tensorflow.python.keras.engine.data_adapter.TensorLikeDataAdapter'>,在_process_inputs()中使用. 相关源代码

(B):numpy.ndarray,由convert_eager_tensors_to_numpy返回. 相关源代码 ,并且此处


模型执行功能(例如预测)

(A):分配函数分布函数(不同)

微型模型:1000次迭代,compile()

微型模型:1000次迭代, compile()

中型模型:10次迭代


文档(间接)对compile()的影响: from tensorflow.keras.layers import Input, Dense, LSTM, Bidirectional, Conv1D from tensorflow.keras.layers import Flatten, Dropout from tensorflow.keras.models import Model import numpy as np from time import time def timeit(func, arg, iterations): t0 = time() for _ in range(iterations): func(arg) print("%.4f sec" % (time() - t0)) batch_size = 32 batch_shape = (batch_size, 400, 16) ipt = Input(batch_shape=batch_shape) x = Bidirectional(LSTM(512, activation='relu', return_sequences=True))(ipt) x = LSTM(512, activation='relu', return_sequences=True)(ipt) x = Conv1D(128, 400, 1, padding='same')(x) x = Flatten()(x) x = Dense(256, activation='relu')(x) x = Dropout(0.5)(x) x = Dense(128, activation='relu')(x) x = Dense(64, activation='relu')(x) out = Dense(1, activation='sigmoid')(x) model = Model(ipt, out) X = np.random.randn(*batch_shape) timeit(model.predict, X, 10) model.compile('adam', loss='binary_crossentropy') timeit(model.predict, X, 10)

输出:

 34.8542 sec
34.7435 sec
 

In theory, the prediction should be constant as the weights have a fixed size. How do I get my speed back after compile (without the need to remove optimizer)?

See associated experiment: https://nbviewer.jupyter.org/github/off99555/TensorFlowExperiments/blob/master/test-prediction-speed-after-compile.ipynb?flush_cache=true

解决方案

UPDATE - 1/15/2020: the current best practice for small batch sizes should be to feed inputs to the model directly - i.e. preds = model(x), and if layers behave differently at train / inference, model(x, training=False). Per latest commit, this is now documented.

I haven't benchmarked these, but per the Git discussion, it's also worth trying predict_on_batch() - especially with improvements in TF 2.1.


ULTIMATE CULPRIT: self._experimental_run_tf_function = True. It's experimental. But it's not actually bad.

To any TensorFlow devs reading: clean up your code. It's a mess. And it violates important coding practices, such as one function does one thing; _process_inputs does a lot more than "process inputs", same for _standardize_user_data. "I'm not paid enough" - but you do pay, in extra time spent understanding your own stuff, and in users filling your Issues page with bugs easier resolved with a clearer code.


SUMMARY: it's only a little slower with compile().

compile() sets an internal flag which assigns a different prediction function to predict. This function constructs a new graph upon each call, slowing it down relative to uncompiled. However, the difference is only pronounced when train time is much shorter than data processing time. If we increase the model size to at least mid-sized, the two become equal. See code at the bottom.

This slight increase in data processing time is more than compensated by amplified graph capability. Since it's more efficient to keep only one model graph around, the one pre-compile is discarded. Nonetheless: if your model is small relative to data, you are better off without compile() for model inference. See my other answer for a workaround.


WHAT SHOULD I DO?

Compare model performance compiled vs uncompiled as I have in code at the bottom.

  • Compiled is faster: run predict on a compiled model.
  • Compiled is slower: run predict on an uncompiled model.

Yes, both are possible, and it will depend on (1) data size; (2) model size; (3) hardware. Code at the bottom actually shows compiled model being faster, but 10 iterations is a small sample. See "workarounds" in my other answer for the "how-to".


DETAILS:

This took a while to debug, but was fun. Below I describe the key culprits I discovered, cite some relevant documentation, and show profiler results that led to the ultimate bottleneck.

(FLAG == self.experimental_run_tf_function, for brevity)

  1. Model by default instantiates with FLAG=False. compile() sets it to True.
  2. predict() involves acquiring the prediction function, func = self._select_training_loop(x)
  3. Without any special kwargs passed to predict and compile, all other flags are such that:
    • (A) FLAG==True --> func = training_v2.Loop()
    • (B) FLAG==False --> func = training_arrays.ArrayLikeTrainingLoop()
  4. From source code docstring, (A) is heavily graph-reliant, uses more distribution strategy, and ops are prone to creating & destroying graph elements, which "may" (do) impact performance.

True culprit: _process_inputs(), accounting for 81% of runtime. Its major component? _create_graph_function(), 72% of runtime. This method does not even exist for (B). Using a mid-sized model, however, _process_inputs comprises less than 1% of runtime. Code at bottom, and profiling results follow.


DATA PROCESSORS:

(A): <class 'tensorflow.python.keras.engine.data_adapter.TensorLikeDataAdapter'>, used in _process_inputs() . Relevant source code

(B): numpy.ndarray, returned by convert_eager_tensors_to_numpy. Relevant source code, and here


MODEL EXECUTION FUNCTION (e.g. predict)

(A): distribution function, and here

(B): distribution function (different), and here


PROFILER: results for code in my other answer, "tiny model", and in this answer, "medium model":

Tiny model: 1000 iterations, compile()

Tiny model: 1000 iterations, no compile()

Medium model: 10 iterations


DOCUMENTATION (indirectly) on effects of compile(): source

Unlike other TensorFlow operations, we don't convert python numerical inputs to tensors. Moreover, a new graph is generated for each distinct python numerical value, for example calling g(2) and g(3) will generate two new graphs

function instantiates a separate graph for every unique set of input shapes and datatypes. For example, the following code snippet will result in three distinct graphs being traced, as each input has a different shape

A single tf.function object might need to map to multiple computation graphs under the hood. This should be visible only as performance (tracing graphs has a nonzero computational and memory cost) but should not affect the correctness of the program


COUNTEREXAMPLE:

from tensorflow.keras.layers import Input, Dense, LSTM, Bidirectional, Conv1D
from tensorflow.keras.layers import Flatten, Dropout
from tensorflow.keras.models import Model
import numpy as np
from time import time

def timeit(func, arg, iterations):
    t0 = time()
    for _ in range(iterations):
        func(arg)
    print("%.4f sec" % (time() - t0))

batch_size = 32
batch_shape = (batch_size, 400, 16)
ipt   = Input(batch_shape=batch_shape)
x     = Bidirectional(LSTM(512, activation='relu', return_sequences=True))(ipt)
x     = LSTM(512, activation='relu', return_sequences=True)(ipt)
x     = Conv1D(128, 400, 1, padding='same')(x)
x     = Flatten()(x)
x     = Dense(256, activation='relu')(x)
x     = Dropout(0.5)(x)
x     = Dense(128, activation='relu')(x)
x     = Dense(64,  activation='relu')(x)
out   = Dense(1,  activation='sigmoid')(x)
model = Model(ipt, out)

X = np.random.randn(*batch_shape)
timeit(model.predict, X, 10)
model.compile('adam', loss='binary_crossentropy')
timeit(model.predict, X, 10)

Outputs:

34.8542 sec
34.7435 sec

这篇关于为什么keras模型预测编译后会变慢?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆