如何使用SSE4.2和AVX指令编译Tensorflow? [英] How to compile Tensorflow with SSE4.2 and AVX instructions?
问题描述
这是从运行脚本检查Tensorflow是否正常工作收到的消息:
I tensorflow / stream_executor / dso_loader。 cc:125]在本地成功打开CUDA库libcublas.so.8.0
我tensorflow / stream_executor / dso_loader.cc:125]在本地成功打开CUDA库libcudnn.so.5
我tensorflow / stream_executor / dso_loader。 cc:125]在本地成功打开CUDA库libcufft.so.8.0
I tensorflow / stream_executor / dso_loader.cc:125]在本地成功打开CUDA库libcuda.so.1
我tensorflow / stream_executor / dso_loader。 cc:125]在本地成功打开CUDA库libcurand.so.8.0
W tensorflow / core / platform / cpu_feature_guard.cc:95] TensorFlow库未编译为使用SSE4.2指令,但这些可在您的机器,并可以加快CPU计算速度。
W tensorflow / core / platform / cpu_feature_guard.cc:95] TensorFlow库尚未编译为使用AVX指令,但是这些指令在您的计算机上可用,并且可以加快CPU计算速度。
I tensorflow / stream_executor / cuda / cuda_gpu_executor.cc:910]从SysFS读取的成功NUMA节点的值为负(-1),但必须至少有一个NUMA节点,因此返回的NUMA节点为零
我注意到它提到了SSE4.2和AVX,
- 什么是SSE4.2和AVX?
- 这些SSE4.2和AVX如何改善Tensorflow任务的CPU计算。
- 如何使用这两个库编译Tensorflow?
我刚刚运行了对于同样的问题,Yaroslav Bulatov的建议似乎不包括对SSE4.2的支持,只需添加-copt = -msse4.2
就可以了。最后,我成功地使用
bazel build -c opt --copt = -mavx --copt = -mavx2- -copt = -mfma --copt = -mfpmath =两者--copt = -msse4.2 --config = cuda -k // tensorflow / tools / pip_package:build_pip_package
而不会收到任何警告或错误。
任何系统的最佳选择可能是:
bazel build -c opt --copt = -march = native --copt = -mfpmath = both --config = cuda -k / / tensorflow / tools / pip_package:build_pip_package
(更新:构建脚本可能正在吃 -march = native
,可能是因为它包含 =
。)
-mfpmath = both
仅适用于gcc,不适用于clang。 -mfpmath = sse
可能同样好,甚至更好,它是x86-64的默认设置。 32位版本的默认设置为 -mfpmath = 387
,因此对其进行更改将有助于32位。 (但是,如果您想要高性能的数字运算,则应该构建64位二进制文件。)
我不确定TensorFlow的默认值 -O2
或 -O3
是。 gcc -O3
可以进行包括自动矢量化在内的全面优化,但这有时会使代码变慢。
功能是什么: -copt
用于 bazel构建
,直接向gcc传递了一个用于编译C和C ++文件的选项(但没有链接,因此您需要跨文件链接时间优化的其他选项)
x86-64 gcc默认仅使用SSE2或更旧的SIMD指令,因此您可以在以下位置运行二进制文件 any x86-64系统。 (请参见 https://gcc.gnu.org/onlinedocs/gcc/x86-Options .html )。那不是你想要的您想制作一个可以利用CPU可以运行的所有指令的二进制文件,因为您只在构建二进制文件的系统上运行该二进制文件。
-march = native
启用CPU支持的所有选项,因此它使 -mavx512f -mavx2 -mavx -mfma -msse4.2
冗余。 (此外, -mavx2
已启用 -mavx
和 -msse4.2
,因此Yaroslav的命令应该没问题)。另外,如果您使用的CPU不支持这些选项之一(例如FMA),则使用 -mfma
会导致二进制文件出现错误指令。 p>
TensorFlow的 ./ configure
默认启用 -march = native
,因此使用它应该避免手动指定编译器选项。
-march = native
启用 -mtune = native
,因此它针对您的CPU进行了优化,例如哪种AVX指令序列最适合未对齐的负载
这全部适用于gcc,clang或ICC。 (对于ICC,您可以使用 -xHOST
代替 -march = native
。)
This is the message received from running a script to check if Tensorflow is working:
I tensorflow/stream_executor/dso_loader.cc:125] successfully opened CUDA library libcublas.so.8.0 locally
I tensorflow/stream_executor/dso_loader.cc:125] successfully opened CUDA library libcudnn.so.5 locally
I tensorflow/stream_executor/dso_loader.cc:125] successfully opened CUDA library libcufft.so.8.0 locally
I tensorflow/stream_executor/dso_loader.cc:125] successfully opened CUDA library libcuda.so.1 locally
I tensorflow/stream_executor/dso_loader.cc:125] successfully opened CUDA library libcurand.so.8.0 locally
W tensorflow/core/platform/cpu_feature_guard.cc:95] The TensorFlow library wasn't compiled to use SSE4.2 instructions, but these are available on your machine and could speed up CPU computations.
W tensorflow/core/platform/cpu_feature_guard.cc:95] The TensorFlow library wasn't compiled to use AVX instructions, but these are available on your machine and could speed up CPU computations.
I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:910] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
I noticed that it has mentioned SSE4.2 and AVX,
- What are SSE4.2 and AVX?
- How do these SSE4.2 and AVX improve CPU computations for Tensorflow tasks.
- How to make Tensorflow compile using the two libraries?
I just ran into this same problem, it seems like Yaroslav Bulatov's suggestion doesn't cover SSE4.2 support, adding --copt=-msse4.2
would suffice. In the end, I successfully built with
bazel build -c opt --copt=-mavx --copt=-mavx2 --copt=-mfma --copt=-mfpmath=both --copt=-msse4.2 --config=cuda -k //tensorflow/tools/pip_package:build_pip_package
without getting any warning or errors.
Probably the best choice for any system is:
bazel build -c opt --copt=-march=native --copt=-mfpmath=both --config=cuda -k //tensorflow/tools/pip_package:build_pip_package
(Update: the build scripts may be eating -march=native
, possibly because it contains an =
.)
-mfpmath=both
only works with gcc, not clang. -mfpmath=sse
is probably just as good, if not better, and is the default for x86-64. 32-bit builds default to -mfpmath=387
, so changing that will help for 32-bit. (But if you want high-performance for number crunching, you should build 64-bit binaries.)
I'm not sure what TensorFlow's default for -O2
or -O3
is. gcc -O3
enables full optimization including auto-vectorization, but that sometimes can make code slower.
What this does: --copt
for bazel build
passes an option directly to gcc for compiling C and C++ files (but not linking, so you need a different option for cross-file link-time-optimization)
x86-64 gcc defaults to using only SSE2 or older SIMD instructions, so you can run the binaries on any x86-64 system. (See https://gcc.gnu.org/onlinedocs/gcc/x86-Options.html). That's not what you want. You want to make a binary that takes advantage of all the instructions your CPU can run, because you're only running this binary on the system where you built it.
-march=native
enables all the options your CPU supports, so it makes -mavx512f -mavx2 -mavx -mfma -msse4.2
redundant. (Also, -mavx2
already enables -mavx
and -msse4.2
, so Yaroslav's command should have been fine). Also if you're using a CPU that doesn't support one of these options (like FMA), using -mfma
would make a binary that faults with illegal instructions.
TensorFlow's ./configure
defaults to enabling -march=native
, so using that should avoid needing to specify compiler options manually.
-march=native
enables -mtune=native
, so it optimizes for your CPU for things like which sequence of AVX instructions is best for unaligned loads.
This all applies to gcc, clang, or ICC. (For ICC, you can use -xHOST
instead of -march=native
.)
这篇关于如何使用SSE4.2和AVX指令编译Tensorflow?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!