为什么 TensorFlow 2 比 TensorFlow 1 慢得多? [英] Why is TensorFlow 2 much slower than TensorFlow 1?

查看:70
本文介绍了为什么 TensorFlow 2 比 TensorFlow 1 慢得多?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

许多用户都将其作为切换到 Pytorch 的原因,但我还没有找到理由/解释来牺牲最重要的实用质量、速度以换取急切执行.

以下是代码基准测试性能,TF1 与 TF2 - TF1 在任何地方的运行速度从 快 47% 到 276%.

我的问题是:是什么导致了如此显着的放缓?

<小时>

正在寻找详细的答案 - 我已经熟悉了广泛的概念. 的说明.

这可能是我对这个答案的最后一次更新.模型速度的真实统计数据只能由您在您的设备上找到.

<小时>

更新 5/19/2020:TF 2.2,使用相同的测试:Eager 速度仅略有改进.下面是大型 Numpy train_on_batch 案例的图,x 轴是连续拟合迭代;我的 GPU 还没有接近其全部容量,所以怀疑它正在节流,但随着时间的推移迭代确实会变慢.

如上所述,Graph 和 Eager 分别比它们的 TF1 对应项慢 1.56 倍1.97 倍.不确定我会进一步调试它,因为我正在考虑切换到 Pytorch,因为 TensorFlow 对自定义/低级功能的支持很差.但是,我确实打开了一个问题以获取开发人员的反馈.

<小时>

更新 2/18/2020:我每晚都使用 2.1 和 2.1;结果喜忧参半.除了一个配置(模型和数据大小)之外的所有配置都与 TF2 和 TF2 中的最佳配置一样快或快得多.TF1.较慢且显着较慢的是大-大 - 尤其是.在图形执行中(慢 1.6 倍到 2.5 倍).

此外,对于我测试的大型模型,Graph 和 Eager 之间存在极端可再现性差异 - 无法通过随机性/计算并行性来解释.我目前无法根据时间限制为这些声明提供可重现的代码,因此我强烈建议您对自己的模型进行测试.

还没有打开关于这些的 Git 问题,但我对 原始 发表了评论a> - 还没有回应.一旦取得进展,我会更新答案.

<小时>

VERDICT:不是,如果你知道自己在做什么.但如果你,它可能会花费你很多 - 平均需要升级几次 GPU,最坏的情况是多个 GPU.

<小时>

THIS ANSWER:旨在提供对该问题的高级描述,以及有关如何根据您的需求决定培训配置的指南.有关详细的低级描述,包括所有基准测试结果 + 使用的代码,请参阅我的其他答案.

如果我学到了任何信息,我会更新我的答案/更多信息 - 可以添加书签/明星";这个问题供参考.

<小时>

问题摘要:如确认由 TensorFlow 开发人员 Q. Scott Zhu 撰写,TF2 专注于 Eager Execution &与 Keras 的紧密集成,其中涉及 TF 源的彻底更改 - 包括在图形级别.好处:大大扩展了处理、分发、调试和部署能力.然而,其中一些的代价是速度.

然而,事情要复杂得多.不仅仅是 TF1 与 TF2 - 导致列车速度显着差异的因素包括:

  1. TF2 与 TF1
  2. Eager vs. Graph 模式
  3. kerastf.keras
  4. numpy vs. tf.data.Dataset vs. ...
  5. train_on_batch() 对比 fit()
  6. GPU 与 CPU
  7. model(x) vs. model.predict(x) vs. ...

不幸的是,以上几乎没有一个是相互独立的,每个相对于另一个至少可以增加一倍的执行时间.幸运的是,您可以通过一些捷径来系统地确定最有效的方法 - 正如我将展示的那样.

<小时>

我应该怎么做?目前,唯一的方法是 - 针对您的特定模型、数据和硬件进行试验.没有任何一种配置总是最有效 - 但有可以和不可以简化您的搜索:

>>做:

  • train_on_batch() + numpy + tf.keras + TF1 + Eager/Graph
  • train_on_batch() + numpy + tf.keras + TF2 + Graph
  • fit() + numpy + tf.keras + TF1/TF2 + Graph + 大模型 &数据

>>不要:

  • fit() + numpy + keras 适用于小型 &中型模型和数据

  • fit() + numpy + tf.keras + TF1/TF2 + Eager

  • train_on_batch() + numpy + keras + TF1 + Eager

  • [主要] tf.python.keras;它的运行速度可以慢 10-100 倍,并且有很多错误;更多信息

    • 这包括layersmodelsoptimizers、&相关的开箱即用"使用进口;操作、实用程序和相关的私人"进口很好 - 但可以肯定的是,检查替代品,&它们是否在 tf.keras
    • 中使用

请参阅我的其他答案底部的代码以获取示例基准测试设置.上面的列表主要基于基准".另一个答案中的表格.

<小时>上述 DO 的

限制不要:

  • 这个问题的标题是为什么 TF2 比 TF1 慢得多?",虽然它的主体明确涉及训练,但问题不仅限于此;推理也受主要速度差异的影响,甚至在相同的 TF 版本、导入、数据格式等内. - 参见 这个答案.
  • RNN 可能会显着改变另一个答案中的数据网格,因为它们在 TF2 中得到了改进
  • 模型主要使用 Conv1DDense - 没有 RNN、稀疏数据/目标、4/5D 输入,以及其他配置
  • 输入数据仅限于 numpytf.data.Dataset,但存在许多其他格式;查看其他答案
  • 使用了 GPU;结果在 CPU 上不同.事实上,当我问这个问题的时候,我的 CUDA 没有正确配置,有些结果是基于 CPU 的.
<小时>

为什么 TF2 牺牲了最实用的质量和速度,以换取急切的执行力? 显然没有 - 图形仍然可用.但如果问题是为什么那么渴望":

  • 出色的调试:您可能会遇到许多问题,询问如何获得中间层输出";或我如何检查重量";使用eager,它(几乎)和.__dict__ 一样简单.相比之下,Graph 需要熟悉特殊的后端功能——这使调试和调试的整个过程变得非常复杂.内省.
  • 更快的原型制作:根据与上述类似的想法;更快的理解 = 留给实际深度学习的时间更多.
<小时>

如何启用/禁用 EAGER?

tf.enable_eager_execution() # TF1;必须在任何模型/张量创建之前完成tf.compat.v1.disable_eager_execution() #TF2;以上持有

TF2 中的

误导;请参阅此处.

<小时>

附加信息:

  • 小心使用 TF2 中的 _on_batch() 方法;根据 TF 开发人员的说法,他们仍然使用较慢的实现,但不是故意 - 即它需要修复.有关详细信息,请参阅其他答案.
<小时>

对 TENSORFLOW DEVS 的请求:

  1. 请修复train_on_batch(),以及迭代调用fit() 的性能方面;自定义火车循环对许多人来说很重要,尤其是对我来说.
  2. 添加文档/文档字符串提及这些性能差异以供用户了解.
  3. 提高总体执行速度,防止窥视者跳到 Pytorch.

<小时>

致谢:感谢

<小时>

更新:

  • 11/14/19 - 发现一个模型(在我的实际应用程序中)在 TF2 上运行速度较慢对于所有*配置 w/Numpy 输入数据.差异范围为 13-19%,平均为 17%.然而,kerastf.keras 之间的差异更为显着:18-40%,平均.32%(TF1 和 2).(* - 除了 Eager,TF2 OOM)

  • 11/17/19 - 开发人员更新了 最近提交,说明速度有所提高 - 将在 TF 2.1 中发布,或现在作为 tf-nightly 提供.由于我无法让后者跑起来,将把替补推迟到 2.1.

  • 2/20/20 - 预测性能也值得测试;例如,在 TF2 中,CPU 预测时间可能涉及周期性尖峰

It's been cited by many users as the reason for switching to Pytorch, but I've yet to find a justification/explanation for sacrificing the most important practical quality, speed, for eager execution.

Below is code benchmarking performance, TF1 vs. TF2 - with TF1 running anywhere from 47% to 276% faster.

My question is: what is it, at the graph or hardware level, that yields such a significant slowdown?


Looking for a detailed answer - am already familiar with broad concepts. Relevant Git

Specs: CUDA 10.0.130, cuDNN 7.4.2, Python 3.7.4, Windows 10, GTX 1070


Benchmark results:


UPDATE: Disabling Eager Execution per below code does not help. The behavior, however, is inconsistent: sometimes running in graph mode helps considerably, other times it runs slower relative to Eager.


Benchmark code:

# use tensorflow.keras... to benchmark tf.keras; used GPU for all above benchmarks
from keras.layers import Input, Dense, LSTM, Bidirectional, Conv1D
from keras.layers import Flatten, Dropout
from keras.models import Model
from keras.optimizers import Adam
import keras.backend as K
import numpy as np
from time import time

batch_shape = (32, 400, 16)
X, y = make_data(batch_shape)

model_small = make_small_model(batch_shape)
model_small.train_on_batch(X, y)  # skip first iteration which builds graph
timeit(model_small.train_on_batch, 200, X, y)

K.clear_session()  # in my testing, kernel was restarted instead

model_medium = make_medium_model(batch_shape)
model_medium.train_on_batch(X, y)  # skip first iteration which builds graph
timeit(model_medium.train_on_batch, 10, X, y)


Functions used:

def timeit(func, iterations, *args):
    t0 = time()
    for _ in range(iterations):
        func(*args)
    print("Time/iter: %.4f sec" % ((time() - t0) / iterations))

def make_small_model(batch_shape):
    ipt   = Input(batch_shape=batch_shape)
    x     = Conv1D(128, 400, strides=4, padding='same')(ipt)
    x     = Flatten()(x)
    x     = Dropout(0.5)(x)
    x     = Dense(64, activation='relu')(x)
    out   = Dense(1,  activation='sigmoid')(x)
    model = Model(ipt, out)
    model.compile(Adam(lr=1e-4), 'binary_crossentropy')
    return model

def make_medium_model(batch_shape):
    ipt   = Input(batch_shape=batch_shape)
    x     = Bidirectional(LSTM(512, activation='relu', return_sequences=True))(ipt)
    x     = LSTM(512, activation='relu', return_sequences=True)(x)
    x     = Conv1D(128, 400, strides=4, padding='same')(x)
    x     = Flatten()(x)
    x     = Dense(256, activation='relu')(x)
    x     = Dropout(0.5)(x)
    x     = Dense(128, activation='relu')(x)
    x     = Dense(64,  activation='relu')(x)
    out   = Dense(1,   activation='sigmoid')(x)
    model = Model(ipt, out)
    model.compile(Adam(lr=1e-4), 'binary_crossentropy')
    return model
    
def make_data(batch_shape):
    return np.random.randn(*batch_shape), np.random.randint(0, 2, (batch_shape[0], 1))

解决方案

UPDATE 8/1730/2020: TF 2.3 has finally done it: all cases run as fast, or notably faster, than any previous version.

Further, my previous update was unfair to TF; my GPU was to blame, has been overheating lately. If you see a rising stem plot of iteration times, it's a reliable symptom. Lastly, see a dev's note on Eager vs Graph.

This might be my last update on this answer. The true stats on your model's speed can only be found by you, on your device.


UPDATE 5/19/2020: TF 2.2, using same tests: only a minor improvement in Eager speed. Plots for Large-Large Numpy train_on_batch case below, x-axis is successive fit iterations; my GPU isn't near its full capacity, so doubt it's throttling, but iterations do get slower over time.

Per above, Graph and Eager are 1.56x and 1.97x slower than their TF1 counterparts, respectively. Unsure I'll debug this further, as I'm considering switching to Pytorch per TensorFlow's poor support for custom / low-level functionality. I did, however, open an Issue to get devs' feedback.


UPDATE 2/18/2020: I've benched 2.1 and 2.1-nightly; the results are mixed. All but one configs (model & data size) are as fast as or much faster than the best of TF2 & TF1. The one that's slower, and slower dramatically, is Large-Large - esp. in Graph execution (1.6x to 2.5x slower).

Furthermore, there are extreme reproducibility differences between Graph and Eager for a large model I tested - one not explainable via randomness/compute-parallelism. I can't currently present reproducible code for these claims per time constraints, so instead I strongly recommend testing this for your own models.

Haven't opened a Git issue on these yet, but I did comment on the original - no response yet. I'll update the answer(s) once progress is made.


VERDICT: it isn't, IF you know what you're doing. But if you don't, it could cost you, lots - by a few GPU upgrades on average, and by multiple GPUs worst-case.


THIS ANSWER: aims to provide a high-level description of the issue, as well as guidelines for how to decide on the training configuration specific to your needs. For a detailed, low-level description, which includes all benchmarking results + code used, see my other answer.

I'll be updating my answer(s) w/ more info if I learn any - can bookmark / "star" this question for reference.


ISSUE SUMMARY: as confirmed by a TensorFlow developer, Q. Scott Zhu, TF2 focused development on Eager execution & tight integration w/ Keras, which involved sweeping changes in TF source - including at graph-level. Benefits: greatly expanded processing, distribution, debug, and deployment capabilities. The cost of some of these, however, is speed.

The matter, however, is fairly more complex. It isn't just TF1 vs. TF2 - factors yielding significant differences in train speed include:

  1. TF2 vs. TF1
  2. Eager vs. Graph mode
  3. keras vs. tf.keras
  4. numpy vs. tf.data.Dataset vs. ...
  5. train_on_batch() vs. fit()
  6. GPU vs. CPU
  7. model(x) vs. model.predict(x) vs. ...

Unfortunately, almost none of the above are independent of the other, and each can at least double execution time relative to another. Fortunately, you can determine what'll work best systematically, and with a few shortcuts - as I'll be showing.


WHAT SHOULD I DO? Currently, the only way is - experiment for your specific model, data, and hardware. No single configuration will always work best - but there are do's and don't's to simplify your search:

>> DO:

  • train_on_batch() + numpy + tf.keras + TF1 + Eager/Graph
  • train_on_batch() + numpy + tf.keras + TF2 + Graph
  • fit() + numpy + tf.keras + TF1/TF2 + Graph + large model & data

>> DON'T:

  • fit() + numpy + keras for small & medium models and data

  • fit() + numpy + tf.keras + TF1/TF2 + Eager

  • train_on_batch() + numpy + keras + TF1 + Eager

  • [Major] tf.python.keras; it can run 10-100x slower, and w/ plenty of bugs; more info

    • This includes layers, models, optimizers, & related "out-of-box" usage imports; ops, utils, & related 'private' imports are fine - but to be sure, check for alts, & whether they're used in tf.keras

Refer to code at bottom of my other answer for an example benchmarking setup. The list above is based mainly on the "BENCHMARKS" tables in the other answer.


LIMITATIONS of the above DO's & DON'T's:

  • This question's titled "Why is TF2 much slower than TF1?", and while its body concerns training explicitly, the matter isn't limited to it; inference, too, is subject to major speed differences, even within the same TF version, import, data format, etc. - see this answer.
  • RNNs are likely to notably change the data grid in the other answer, as they've been improved in TF2
  • Models primarily used Conv1D and Dense - no RNNs, sparse data/targets, 4/5D inputs, & other configs
  • Input data limited to numpy and tf.data.Dataset, while many other formats exist; see other answer
  • GPU was used; results will differ on a CPU. In fact, when I asked the question, my CUDA wasn't properly configured, and some of the results were CPU-based.

Why did TF2 sacrifice the most practical quality, speed, for eager execution? It hasn't, clearly - graph is still available. But if the question is "why eager at all":

  • Superior debugging: you've likely come across multitudes of questions asking "how do I get intermediate layer outputs" or "how do I inspect weights"; with eager, it's (almost) as simple as .__dict__. Graph, in contrast, requires familiarity with special backend functions - greatly complicating the entire process of debugging & introspection.
  • Faster prototyping: per ideas similar to above; faster understanding = more time left for actual DL.

HOW TO ENABLE/DISABLE EAGER?

tf.enable_eager_execution()  # TF1; must be done before any model/tensor creation
tf.compat.v1.disable_eager_execution() # TF2; above holds

Misleading in TF2; see here.


ADDITIONAL INFO:

  • Careful with _on_batch() methods in TF2; according to the TF dev, they still use a slower implementation, but not intentionally - i.e. it's to be fixed. See other answer for details.

REQUESTS TO TENSORFLOW DEVS:

  1. Please fix train_on_batch(), and the performance aspect of calling fit() iteratively; custom train loops are important to many, especially to me.
  2. Add documentation / docstring mention of these performance differences for users' knowledge.
  3. Improve general execution speed to keep peeps from hopping to Pytorch.


ACKNOWLEDGEMENTS: Thanks to


UPDATES:

  • 11/14/19 - found a model (in my real application) that that runs slower on TF2 for all* configurations w/ Numpy input data. Differences ranged 13-19%, averaging 17%. Differences between keras and tf.keras, however, were more dramatic: 18-40%, avg. 32% (both TF1 & 2). (* - except Eager, for which TF2 OOM'd)

  • 11/17/19 - devs updated on_batch() methods in a recent commit, stating to have improved speed - to be released in TF 2.1, or available now as tf-nightly. As I'm unable to get latter running, will delay benching until 2.1.

  • 2/20/20 - prediction performance is also worth benching; in TF2, for example, CPU prediction times can involve periodic spikes

这篇关于为什么 TensorFlow 2 比 TensorFlow 1 慢得多?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆