为什么 keras 模型在编译后预测更慢? [英] Why does keras model predict slower after compile?

查看:40
本文介绍了为什么 keras 模型在编译后预测更慢?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

理论上,由于权重具有固定大小,因此预测应该是恒定的.编译后如何恢复速度(无需删除优化器)?

查看相关实验:

<块引用>

与其他 TensorFlow 操作不同,我们不转换 python张量的数字输入.此外,为每个生成一个新图不同的python数值,例如调用g(2)g(3)会生成两个新图

function 为每组唯一的输入实例化一个单独的图形形状和数据类型.例如,下面的代码片段将导致在被追踪的三个不同的图中,因为每个输入都有不同的形状

单个 tf.function 对象可能需要映射到多个计算图在引擎盖下.这应该仅在 性能 时可见(跟踪图具有非零计算和内存成本)但不应该影响正确性程序的

<小时>

反例:

from tensorflow.keras.layers import Input, Dense, LSTM, Bidirectional, Conv1D从 tensorflow.keras.layers 导入 Flatten,Dropout从 tensorflow.keras.models 导入模型将 numpy 导入为 np从时间导入时间def timeit(func, arg, 迭代):t0 = 时间()对于 _ 范围内(迭代):功能(参数)打印(%.4f 秒"%(时间() - t0))批量大小 = 32批量形状 = (批量大小, 400, 16)ipt = 输入(batch_shape=batch_shape)x = 双向(LSTM(512, activation='relu', return_sequences=True))(ipt)x = LSTM(512, activation='relu', return_sequences=True)(ipt)x = Conv1D(128, 400, 1, padding='same')(x)x = 展平()(x)x = Dense(256, activation='relu')(x)x = 辍学(0.5)(x)x = Dense(128, activation='relu')(x)x = Dense(64, activation='relu')(x)out = Dense(1, activation='sigmoid')(x)模型 = 模型(ipt,输出)X = np.random.randn(*batch_shape)时间(模型.预测,X,10)model.compile('adam', loss='binary_crossentropy')时间(模型.预测,X,10)

输出:

34.8542 秒34.7435 秒

In theory, the prediction should be constant as the weights have a fixed size. How do I get my speed back after compile (without the need to remove optimizer)?

See associated experiment: https://nbviewer.jupyter.org/github/off99555/TensorFlowExperiments/blob/master/test-prediction-speed-after-compile.ipynb?flush_cache=true

解决方案

UPDATE - 1/15/2020: the current best practice for small batch sizes should be to feed inputs to the model directly - i.e. preds = model(x), and if layers behave differently at train / inference, model(x, training=False). Per latest commit, this is now documented.

I haven't benchmarked these, but per the Git discussion, it's also worth trying predict_on_batch() - especially with improvements in TF 2.1.


ULTIMATE CULPRIT: self._experimental_run_tf_function = True. It's experimental. But it's not actually bad.

To any TensorFlow devs reading: clean up your code. It's a mess. And it violates important coding practices, such as one function does one thing; _process_inputs does a lot more than "process inputs", same for _standardize_user_data. "I'm not paid enough" - but you do pay, in extra time spent understanding your own stuff, and in users filling your Issues page with bugs easier resolved with a clearer code.


SUMMARY: it's only a little slower with compile().

compile() sets an internal flag which assigns a different prediction function to predict. This function constructs a new graph upon each call, slowing it down relative to uncompiled. However, the difference is only pronounced when train time is much shorter than data processing time. If we increase the model size to at least mid-sized, the two become equal. See code at the bottom.

This slight increase in data processing time is more than compensated by amplified graph capability. Since it's more efficient to keep only one model graph around, the one pre-compile is discarded. Nonetheless: if your model is small relative to data, you are better off without compile() for model inference. See my other answer for a workaround.


WHAT SHOULD I DO?

Compare model performance compiled vs uncompiled as I have in code at the bottom.

  • Compiled is faster: run predict on a compiled model.
  • Compiled is slower: run predict on an uncompiled model.

Yes, both are possible, and it will depend on (1) data size; (2) model size; (3) hardware. Code at the bottom actually shows compiled model being faster, but 10 iterations is a small sample. See "workarounds" in my other answer for the "how-to".


DETAILS:

This took a while to debug, but was fun. Below I describe the key culprits I discovered, cite some relevant documentation, and show profiler results that led to the ultimate bottleneck.

(FLAG == self.experimental_run_tf_function, for brevity)

  1. Model by default instantiates with FLAG=False. compile() sets it to True.
  2. predict() involves acquiring the prediction function, func = self._select_training_loop(x)
  3. Without any special kwargs passed to predict and compile, all other flags are such that:
    • (A) FLAG==True --> func = training_v2.Loop()
    • (B) FLAG==False --> func = training_arrays.ArrayLikeTrainingLoop()
  4. From source code docstring, (A) is heavily graph-reliant, uses more distribution strategy, and ops are prone to creating & destroying graph elements, which "may" (do) impact performance.

True culprit: _process_inputs(), accounting for 81% of runtime. Its major component? _create_graph_function(), 72% of runtime. This method does not even exist for (B). Using a mid-sized model, however, _process_inputs comprises less than 1% of runtime. Code at bottom, and profiling results follow.


DATA PROCESSORS:

(A): <class 'tensorflow.python.keras.engine.data_adapter.TensorLikeDataAdapter'>, used in _process_inputs() . Relevant source code

(B): numpy.ndarray, returned by convert_eager_tensors_to_numpy. Relevant source code, and here


MODEL EXECUTION FUNCTION (e.g. predict)

(A): distribution function, and here

(B): distribution function (different), and here


PROFILER: results for code in my other answer, "tiny model", and in this answer, "medium model":

Tiny model: 1000 iterations, compile()

Tiny model: 1000 iterations, no compile()

Medium model: 10 iterations


DOCUMENTATION (indirectly) on effects of compile(): source

Unlike other TensorFlow operations, we don't convert python numerical inputs to tensors. Moreover, a new graph is generated for each distinct python numerical value, for example calling g(2) and g(3) will generate two new graphs

function instantiates a separate graph for every unique set of input shapes and datatypes. For example, the following code snippet will result in three distinct graphs being traced, as each input has a different shape

A single tf.function object might need to map to multiple computation graphs under the hood. This should be visible only as performance (tracing graphs has a nonzero computational and memory cost) but should not affect the correctness of the program


COUNTEREXAMPLE:

from tensorflow.keras.layers import Input, Dense, LSTM, Bidirectional, Conv1D
from tensorflow.keras.layers import Flatten, Dropout
from tensorflow.keras.models import Model
import numpy as np
from time import time

def timeit(func, arg, iterations):
    t0 = time()
    for _ in range(iterations):
        func(arg)
    print("%.4f sec" % (time() - t0))

batch_size = 32
batch_shape = (batch_size, 400, 16)
ipt   = Input(batch_shape=batch_shape)
x     = Bidirectional(LSTM(512, activation='relu', return_sequences=True))(ipt)
x     = LSTM(512, activation='relu', return_sequences=True)(ipt)
x     = Conv1D(128, 400, 1, padding='same')(x)
x     = Flatten()(x)
x     = Dense(256, activation='relu')(x)
x     = Dropout(0.5)(x)
x     = Dense(128, activation='relu')(x)
x     = Dense(64,  activation='relu')(x)
out   = Dense(1,  activation='sigmoid')(x)
model = Model(ipt, out)

X = np.random.randn(*batch_shape)
timeit(model.predict, X, 10)
model.compile('adam', loss='binary_crossentropy')
timeit(model.predict, X, 10)

Outputs:

34.8542 sec
34.7435 sec

这篇关于为什么 keras 模型在编译后预测更慢?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆