如何使用 SSE4.2 和 AVX 指令编译 Tensorflow? [英] How to compile Tensorflow with SSE4.2 and AVX instructions?

查看:45
本文介绍了如何使用 SSE4.2 和 AVX 指令编译 Tensorflow?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

这是从运行脚本以检查 Tensorflow 是否正常工作收到的消息:

This is the message received from running a script to check if Tensorflow is working:

I tensorflow/stream_executor/dso_loader.cc:125] successfully opened CUDA library libcublas.so.8.0 locally
I tensorflow/stream_executor/dso_loader.cc:125] successfully opened CUDA library libcudnn.so.5 locally
I tensorflow/stream_executor/dso_loader.cc:125] successfully opened CUDA library libcufft.so.8.0 locally
I tensorflow/stream_executor/dso_loader.cc:125] successfully opened CUDA library libcuda.so.1 locally
I tensorflow/stream_executor/dso_loader.cc:125] successfully opened CUDA library libcurand.so.8.0 locally
W tensorflow/core/platform/cpu_feature_guard.cc:95] The TensorFlow library wasn't compiled to use SSE4.2 instructions, but these are available on your machine and could speed up CPU computations.
W tensorflow/core/platform/cpu_feature_guard.cc:95] The TensorFlow library wasn't compiled to use AVX instructions, but these are available on your machine and could speed up CPU computations.
I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:910] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero

我注意到它提到了 SSE4.2 和 AVX,

I noticed that it has mentioned SSE4.2 and AVX,

  1. 什么是 SSE4.2 和 AVX?
  2. 这些 SSE4.2 和 AVX 如何改进 Tensorflow 任务的 CPU 计算.
  3. 如何使用这两个库编译 Tensorflow?

推荐答案

我刚遇到同样的问题,好像 Yaroslav Bulatov 的建议没有涵盖 SSE4.2 支持,添加 --copt=-msse4.2 就足够了.最后,我成功构建了

I just ran into this same problem, it seems like Yaroslav Bulatov's suggestion doesn't cover SSE4.2 support, adding --copt=-msse4.2 would suffice. In the end, I successfully built with

bazel build -c opt --copt=-mavx --copt=-mavx2 --copt=-mfma --copt=-mfpmath=both --copt=-msse4.2 --config=cuda -k //tensorflow/tools/pip_package:build_pip_package

没有收到任何警告或错误.

without getting any warning or errors.

对于任何系统来说,最好的选择可能是:

Probably the best choice for any system is:

bazel build -c opt --copt=-march=native --copt=-mfpmath=both --config=cuda -k //tensorflow/tools/pip_package:build_pip_package

(更新:构建脚本可能会占用 -march=native,可能是因为它包含一个 =.)

(Update: the build scripts may be eating -march=native, possibly because it contains an =.)

-mfpmath=both 仅适用于 gcc,不适用于 clang.-mfpmath=sse 可能同样好,甚至更好,并且是 x86-64 的默认设置.32 位构建默认为 -mfpmath=387,因此更改这将有助于 32 位.(但如果您想要高性能的数字运算,则应该构建 64 位二进制文​​件.)

-mfpmath=both only works with gcc, not clang. -mfpmath=sse is probably just as good, if not better, and is the default for x86-64. 32-bit builds default to -mfpmath=387, so changing that will help for 32-bit. (But if you want high-performance for number crunching, you should build 64-bit binaries.)

我不确定 -O2-O3 的 TensorFlow 默认值是什么.gcc -O3 启用包括自动矢量化在内的全面优化,但这有时会使代码变慢.

I'm not sure what TensorFlow's default for -O2 or -O3 is. gcc -O3 enables full optimization including auto-vectorization, but that sometimes can make code slower.

这是做什么的:--copt对于 bazel build 将一个选项直接传递给 gcc 以编译 C 和 C++ 文件(但不链接,因此您需要一个不同的跨文件链接时间优化选项)

What this does: --copt for bazel build passes an option directly to gcc for compiling C and C++ files (but not linking, so you need a different option for cross-file link-time-optimization)

x86-64 gcc 默认仅使用 SSE2 或更旧的 SIMD 指令,因此您可以在 任何 x86-64 系统上运行二进制文件.(参见 https://gcc.gnu.org/onlinedocs/gcc/x86-Options.html).那不是你想要的.您想制作一个二进制文件,以利用您的 CPU 可以运行的所有指令,因为您只是在构建它的系统上运行这个二进制文件.

x86-64 gcc defaults to using only SSE2 or older SIMD instructions, so you can run the binaries on any x86-64 system. (See https://gcc.gnu.org/onlinedocs/gcc/x86-Options.html). That's not what you want. You want to make a binary that takes advantage of all the instructions your CPU can run, because you're only running this binary on the system where you built it.

-march=native 启用您的 CPU 支持的所有选项,因此它使 -mavx512f -mavx2 -mavx -mfma -msse4.2 变得多余.(另外,-mavx2 已经启用了 -mavx-msse4.2,所以 Yaroslav 的命令应该没问题).此外,如果您使用的 CPU 不支持这些选项之一(如 FMA),则使用 -mfma 会使二进制文件出现非法指令错误.

-march=native enables all the options your CPU supports, so it makes -mavx512f -mavx2 -mavx -mfma -msse4.2 redundant. (Also, -mavx2 already enables -mavx and -msse4.2, so Yaroslav's command should have been fine). Also if you're using a CPU that doesn't support one of these options (like FMA), using -mfma would make a binary that faults with illegal instructions.

TensorFlow 的 ./configure 默认启用 -march=native,因此使用它应该避免需要手动指定编译器选项.

TensorFlow's ./configure defaults to enabling -march=native, so using that should avoid needing to specify compiler options manually.

-march=native 启用 -mtune=native,所以 它针对您的 CPU 进行优化,例如哪些 AVX 指令序列最适合未对齐的加载.

-march=native enables -mtune=native, so it optimizes for your CPU for things like which sequence of AVX instructions is best for unaligned loads.

这一切都适用于 gcc、clang 或 ICC.(对于 ICC,您可以使用 -xHOST 而不是 -march=native.)

This all applies to gcc, clang, or ICC. (For ICC, you can use -xHOST instead of -march=native.)

这篇关于如何使用 SSE4.2 和 AVX 指令编译 Tensorflow?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆