为什么TensorFlow 2比TensorFlow 1慢得多? [英] Why is TensorFlow 2 much slower than TensorFlow 1?

查看:835
本文介绍了为什么TensorFlow 2比TensorFlow 1慢得多?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

许多用户认为它是切换到Pytorch的原因,但我还没有找到牺牲/最渴望的实用质量,速度和执行力的理由/解释.

It's been cited by many users as the reason for switching to Pytorch, but I've yet to find a justification / explanation for sacrificing the most important practical quality, speed, for eager execution.

下面是TF1与TF2的代码基准测试性能-TF1的运行速度 47%到276%.

Below is code benchmarking performance, TF1 vs. TF2 - with TF1 running anywhere from 47% to 276% faster.

我的问题是:在图形或硬件级别上,什么导致如此显着的下降?

寻找详细的答案-已经熟悉广泛的概念. 相关的Git

Looking for a detailed answer - am already familiar with broad concepts. Relevant Git

规格:CUDA 10.0.130,cuDNN 7.4.2,Python 3.7.4,Windows 10,GTX 1070

Specs: CUDA 10.0.130, cuDNN 7.4.2, Python 3.7.4, Windows 10, GTX 1070

基准测试结果:

更新:根据以下代码禁用急切执行对您没有帮助.但是,这种行为是不一致的:有时在图形模式下运行会有很大帮助,而其他时候它相对于Eager则运行 slower .

UPDATE: Disabling Eager Execution per below code does not help. The behavior, however, is inconsistent: sometimes running in graph mode helps considerably, other times it runs slower relative to Eager.

由于TF开发人员没有出现在任何地方,我将亲自调查此事-可以跟踪相关的Github问题的进展.

As TF devs don't appear around anywhere, I'll be investigating this matter myself - can follow progress in the linked Github issue.

更新2 :分享了大量的实验结果以及相关说明;应该在今天完成.

UPDATE 2: tons of experimental results to share, along explanations; should be done today.

基准代码:

# use tensorflow.keras... to benchmark tf.keras; used GPU for all above benchmarks
from keras.layers import Input, Dense, LSTM, Bidirectional, Conv1D
from keras.layers import Flatten, Dropout
from keras.models import Model
from keras.optimizers import Adam
import keras.backend as K
import numpy as np
from time import time

batch_shape = (32, 400, 16)
X, y = make_data(batch_shape)

model_small = make_small_model(batch_shape)
model_small.train_on_batch(X, y)  # skip first iteration which builds graph
timeit(model_small.train_on_batch, 200, X, y)

K.clear_session()  # in my testing, kernel was restarted instead

model_medium = make_medium_model(batch_shape)
model_medium.train_on_batch(X, y)  # skip first iteration which builds graph
timeit(model_medium.train_on_batch, 10, X, y)


使用的功能:

def timeit(func, iterations, *args):
    t0 = time()
    for _ in range(iterations):
        func(*args)
    print("Time/iter: %.4f sec" % ((time() - t0) / iterations))

def make_small_model(batch_shape):
    ipt   = Input(batch_shape=batch_shape)
    x     = Conv1D(128, 400, strides=4, padding='same')(ipt)
    x     = Flatten()(x)
    x     = Dropout(0.5)(x)
    x     = Dense(64, activation='relu')(x)
    out   = Dense(1,  activation='sigmoid')(x)
    model = Model(ipt, out)
    model.compile(Adam(lr=1e-4), 'binary_crossentropy')
    return model

def make_medium_model(batch_shape):
    ipt   = Input(batch_shape=batch_shape)
    x     = Bidirectional(LSTM(512, activation='relu', return_sequences=True))(ipt)
    x     = LSTM(512, activation='relu', return_sequences=True)(x)
    x     = Conv1D(128, 400, strides=4, padding='same')(x)
    x     = Flatten()(x)
    x     = Dense(256, activation='relu')(x)
    x     = Dropout(0.5)(x)
    x     = Dense(128, activation='relu')(x)
    x     = Dense(64,  activation='relu')(x)
    out   = Dense(1,   activation='sigmoid')(x)
    model = Model(ipt, out)
    model.compile(Adam(lr=1e-4), 'binary_crossentropy')
    return model

def make_data(batch_shape):
    return np.random.randn(*batch_shape), np.random.randint(0, 2, (batch_shape[0], 1))

推荐答案

更新2/18/2020 :我每晚都有2.1和2.1的替补;结果好坏参半.除了一个配置(模型和数据大小)外,其他配置的运行速度都比TF2和最佳配置速度快或快得多. TF1.速度较慢且急剧下降的是大型-尤其是.图表执行速度(慢1.6倍至2.5倍).

UPDATE 2/18/2020: I've benched 2.1 and 2.1-nightly; the results are mixed. All but one configs (model & data size) are as fast as or much faster than the best of TF2 & TF1. The one that's slower, and slower dramatically, is Large-Large - esp. in Graph execution (1.6x to 2.5x slower).

此外,我测试的大型模型在Graph和Eager之间存在 extreme 再现性差异-无法通过随机性/计算并行性来解释.我目前无法按时间限制显示这些声明的可重现代码,因此我强烈建议您针对自己的模型进行测试.

Furthermore, there are extreme reproducibility differences between Graph and Eager for a large model I tested - one not explainable via randomness/compute-parallelism. I can't currently present reproducible code for these claims per time constraints, so instead I strongly recommend testing this for your own models.

还没有针对这些问题打开Git问题,但是我对原始-尚未回复.取得进展后,我将更新答案.

Haven't opened a Git issue on these yet, but I did comment on the original - no response yet. I'll update the answer(s) once progress is made.

VERDICT :如果您知道自己在做什么,就不是.但是,如果您,则可能会花费大量成本-平均而言需要升级几次GPU,而在最坏的情况下则会花费多个GPU.

VERDICT: it isn't, IF you know what you're doing. But if you don't, it could cost you, lots - by a few GPU upgrades on average, and by multiple GPUs worst-case.

此答案:旨在提供对该问题的高级描述,以及有关如何根据您的需求决定培训配置的准则.有关详细的低级描述(包括所有基准测试结果和所使用的代码),请参阅我的其他答案.

THIS ANSWER: aims to provide a high-level description of the issue, as well as guidelines for how to decide on the training configuration specific to your needs. For a detailed, low-level description, which includes all benchmarking results + code used, see my other answer.

如果我学到了更多信息,我将使用更多信息来更新我的答案-可以在此问题中添加书签/加注星标"以供参考.

I'll be updating my answer(s) w/ more info if I learn any - can bookmark / "star" this question for reference.

问题摘要:已确认由TensorFlow开发人员Q.Scott Zhu主持,TF2专注于Eager执行和开发.与Keras紧密集成,这涉及到TF源的全面更改-包括图形级.好处:大大扩展了处理,分发,调试和部署功能.但是,其中一些的代价是速度.

ISSUE SUMMARY: as confirmed by a TensorFlow developer, Q. Scott Zhu, TF2 focused development on Eager execution & tight integration w/ Keras, which involved sweeping changes in TF source - including at graph-level. Benefits: greatly expanded processing, distribution, debug, and deployment capabilities. The cost of some of these, however, is speed.

但是,这个问题要复杂得多.不仅仅是TF1和TF2-导致火车速度显着差异的因素包括:

The matter, however, is fairly more complex. It isn't just TF1 vs. TF2 - factors yielding significant differences in train speed include:

  1. TF2与TF1
  2. 渴望与图表模式
  3. kerastf.keras
  4. numpytf.data.Dataset与...
  5. train_on_batch()fit()
  6. GPU与CPU
  7. model(x)model.predict(x)与...
  1. TF2 vs. TF1
  2. Eager vs. Graph mode
  3. keras vs. tf.keras
  4. numpy vs. tf.data.Dataset vs. ...
  5. train_on_batch() vs. fit()
  6. GPU vs. CPU
  7. model(x) vs. model.predict(x) vs. ...

不幸的是,以上几乎没有一个是彼此独立的,并且每个相对于另一个可以至少使执行时间加倍.幸运的是,您可以确定哪些是系统上最有效的方法,并有一些捷径-正如我将要演示的.

Unfortunately, almost none of the above are independent of the other, and each can at least double execution time relative to another. Fortunately, you can determine what'll work best systematically, and with a few shortcuts - as I'll be showing.

我应该做什么?目前,唯一的方法是-针对您的特定型号,数据和硬件进行实验.没有一个单一的配置总能达到最佳效果-但可以简化和简化 的操作:

WHAT SHOULD I DO? Currently, the only way is - experiment for your specific model, data, and hardware. No single configuration will always work best - but there are do's and don't's to simplify your search:

>>执行:

  • train_on_batch() + numpy + tf.keras + TF1 +渴望/图形
  • train_on_batch() + numpy + tf.keras + TF2 +图形
  • fit() + numpy + tf.keras + TF1/TF2 +图+大型模型&数据
  • train_on_batch() + numpy + tf.keras + TF1 + Eager/Graph
  • train_on_batch() + numpy + tf.keras + TF2 + Graph
  • fit() + numpy + tf.keras + TF1/TF2 + Graph + large model & data

>>不要:

  • fit() + numpy + keras用于小型&中等模型和数据
  • fit() + numpy + tf.keras + TF1/TF2 +渴望
  • train_on_batch() + numpy + keras + TF1 +渴望

  • fit() + numpy + keras for small & medium models and data
  • fit() + numpy + tf.keras + TF1/TF2 + Eager
  • train_on_batch() + numpy + keras + TF1 + Eager

[主要] tf.python.keras;它的运行速度可以降低10到100倍,并且带有许多错误; 更多信息

[Major] tf.python.keras; it can run 10-100x slower, and w/ plenty of bugs; more info

  • 这包括layersmodelsoptimizers和&相关的即用型"用法导入; ops,utils和&相关的私人"进口是可以的-但可以肯定的是,请检查alt和&是否在tf.keras
  • 中使用了它们
  • This includes layers, models, optimizers, & related "out-of-box" usage imports; ops, utils, & related 'private' imports are fine - but to be sure, check for alts, & whether they're used in tf.keras

有关基准测试设置示例,请参阅其他答案底部的代码.上面的列表主要基于其他答案中的"BENCHMARKS"表.

Refer to code at bottom of my other answer for an example benchmarking setup. The list above is based mainly on the "BENCHMARKS" tables in the other answer.

限制&不要:

  • 这个问题的标题是为什么TF2比TF1慢得多?",尽管它的主体明确地涉及训练,但问题不仅仅限于此. 推断也会受到主要速度差异的影响,在同一TF版本,导入,数据格式等范围内,偶数-请参阅
  • This question's titled "Why is TF2 much slower than TF1?", and while its body concerns training explicitly, the matter isn't limited to it; inference, too, is subject to major speed differences, even within the same TF version, import, data format, etc. - see this answer.
  • RNNs are likely to notably change the data grid in the other answer, as they've been improved in TF2
  • Models primarily used Conv1D and Dense - no RNNs, sparse data/targets, 4/5D inputs, & other configs
  • Input data limited to numpy and tf.data.Dataset, while many other formats exist; see other answer
  • GPU was used; results will differ on a CPU. In fact, when I asked the question, my CUDA wasn't properly configured, and some of the results were CPU-based.

为什么TF2为了急切的执行而牺牲了最实用的质量,速度?显然,它还没有提供.但是如果问题是为什么要渴望":

Why did TF2 sacrifice the most practical quality, speed, for eager execution? It hasn't, clearly - graph is still available. But if the question is "why eager at all":

  • 高级调试:您可能会遇到许多问题,询问如何获得中间层输出"或如何检查权重";渴望,它(几乎)像.__dict__一样简单.相比之下,Graph需要熟悉特殊的后端功能-极大地增加了调试和维护的整个过程.内省.
  • 快速原型制作:根据与上述类似的想法进行;更快的理解=剩下更多的时间用于实际DL.
  • Superior debugging: you've likely come across multitudes of questions asking "how do I get intermediate layer outputs" or "how do I inspect weights"; with eager, it's (almost) as simple as .__dict__. Graph, in contrast, requires familiarity with special backend functions - greatly complicating the entire process of debugging & introspection.
  • Faster prototyping: per ideas similar to above; faster understanding = more time left for actual DL.

如何启用/禁用EAGER?

tf.enable_eager_execution()  # TF1; must be done before any model/tensor creation
tf.compat.v1.disable_eager_execution() # TF2; above holds


其他信息:

  • 注意TF2中的_on_batch()方法;根据TF开发人员的说法,他们仍然使用较慢的实现方式,但不是故意的-也就是说,这是固定的.有关详细信息,请参见其他答案.
  • Careful with _on_batch() methods in TF2; according to the TF dev, they still use a slower implementation, but not intentionally - i.e. it's to be fixed. See other answer for details.

张力流需求:

  1. 请修复train_on_batch(),以及迭代调用fit()的性能方面;定制火车循环对许多人尤其是我来说很重要.
  2. 添加有关这些性能差异的文档/文档字符串,以供用户了解.
  3. 提高总体执行速度,以防止窥视现象转移到Pytorch.
  1. Please fix train_on_batch(), and the performance aspect of calling fit() iteratively; custom train loops are important to many, especially to me.
  2. Add documentation / docstring mention of these performance differences for users' knowledge.
  3. Improve general execution speed to keep peeps from hopping to Pytorch.


致谢:感谢


ACKNOWLEDGEMENTS: Thanks to

  • Q. Scott Zhu, TensorFlow developer, for his detailed clarification on the matter.
  • P. Andrey for sharing useful testing, and discussion.

更新:

  • 11/14/19 -找到了一个模型(在我的实际应用中),对于所有*配置,该模型在TF2 上运行速度较慢,带有多个输入数据.差异范围为13-19%,平均为17%.但是,kerastf.keras之间的差异更为明显: 18-40%. 32%(均为TF1和2). (*-Eager除外,为此需要TF2 OOM)

  • 11/14/19 - found a model (in my real application) that that runs slower on TF2 for all* configurations w/ Numpy input data. Differences ranged 13-19%, averaging 17%. Differences between keras and tf.keras, however, were more dramatic: 18-40%, avg. 32% (both TF1 & 2). (* - except Eager, for which TF2 OOM'd)

11/17/19 -开发人员在

11/17/19 - devs updated on_batch() methods in a recent commit, stating to have improved speed - to be released in TF 2.1, or available now as tf-nightly. As I'm unable to get latter running, will delay benching until 2.1.

这篇关于为什么TensorFlow 2比TensorFlow 1慢得多?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆