如何使用SSE4.2和AVX指令编译Tensorflow? [英] How to compile Tensorflow with SSE4.2 and AVX instructions?

查看:255
本文介绍了如何使用SSE4.2和AVX指令编译Tensorflow?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

这是从运行脚本检查Tensorflow是否正常工作收到的消息:

  I tensorflow / stream_executor / dso_loader。 cc:125]在本地成功打开CUDA库libcublas.so.8.0 
我tensorflow / stream_executor / dso_loader.cc:125]在本地成功打开CUDA库libcudnn.so.5
我tensorflow / stream_executor / dso_loader。 cc:125]在本地成功打开CUDA库libcufft.so.8.0
I tensorflow / stream_executor / dso_loader.cc:125]在本地成功打开CUDA库libcuda.so.1
我tensorflow / stream_executor / dso_loader。 cc:125]在本地成功打开CUDA库libcurand.so.8.0
W tensorflow / core / platform / cpu_feature_guard.cc:95] TensorFlow库未编译为使用SSE4.2指令,但这些可在您的机器,并可以加快CPU计算速度。
W tensorflow / core / platform / cpu_feature_guard.cc:95] TensorFlow库尚未编译为使用AVX指令,但是这些指令在您的计算机上可用,并且可以加快CPU计算速度。
I tensorflow / stream_executor / cuda / cuda_gpu_executor.cc:910]从SysFS读取的成功NUMA节点的值为负(-1),但必须至少有一个NUMA节点,因此返回的NUMA节点为零

我注意到它提到了SSE4.2和AVX,


  1. 什么是SSE4.2和AVX?

  2. 这些SSE4.2和AVX如何改善Tensorflow任务的CPU计算。

  3. 如何使用这两个库编译Tensorflow?


解决方案

我刚刚运行了对于同样的问题,Yaroslav Bulatov的建议似乎不包括对SSE4.2的支持,只需添加-copt = -msse4.2 就可以了。最后,我成功地使用

  bazel build -c opt --copt = -mavx --copt = -mavx2- -copt = -mfma --copt = -mfpmath =两者--copt = -msse4.2 --config = cuda -k // tensorflow / tools / pip_package:build_pip_package 

而不会收到任何警告或错误。



任何系统的最佳选择可能是:

  bazel build -c opt --copt = -march = native --copt = -mfpmath = both --config = cuda -k / / tensorflow / tools / pip_package:build_pip_package 

更新:构建脚本可能正在吃 -march = native ,可能是因为它包含 = 。)



-mfpmath = both 仅适用于gcc,不适用于clang。 -mfpmath = sse 可能同样好,甚至更好,它是x86-64的默认设置。 32位版本的默认设置为 -mfpmath = 387 ,因此对其进行更改将有助于32位。 (但是,如果您想要高性能的数字运算,则应该构建64位二进制文​​件。)



我不确定TensorFlow的默认值 -O2 -O3 是。 gcc -O3 可以进行包括自动矢量化在内的全面优化,但这有时会使代码变慢。






功能是什么: -copt 用于 bazel构建 ,直接向gcc传递了一个用于编译C和C ++文件的选项(但没有链接,因此您需要跨文件链接时间优化的其他选项)



x86-64 gcc默认仅使用SSE2或更旧的SIMD指令,因此您可以在以下位置运行二进制文件 any x86-64系统。 (请参见 https://gcc.gnu.org/onlinedocs/gcc/x86-Options .html )。那不是你想要的您想制作一个可以利用CPU可以运行的所有指令的二进制文件,因为您只在构建二进制文件的系统上运行该二进制文件。



-march = native 启用CPU支持的所有选项,因此它使 -mavx512f -mavx2 -mavx -mfma -msse4.2 冗余。 (此外, -mavx2 已启用 -mavx -msse4.2 ,因此Yaroslav的命令应该没问题)。另外,如果您使用的CPU不支持这些选项之一(例如FMA),则使用 -mfma 会导致二进制文件出现错误指令。 p>

TensorFlow的 ./ configure 默认启用 -march = native ,因此使用它应该避免手动指定编译器选项。



-march = native 启用 -mtune = native ,因此它针对您的CPU进行了优化,例如哪种AVX指令序列最适合未对齐的负载



这全部适用于gcc,clang或ICC。 (对于ICC,您可以使用 -xHOST 代替 -march = native 。)


This is the message received from running a script to check if Tensorflow is working:

I tensorflow/stream_executor/dso_loader.cc:125] successfully opened CUDA library libcublas.so.8.0 locally
I tensorflow/stream_executor/dso_loader.cc:125] successfully opened CUDA library libcudnn.so.5 locally
I tensorflow/stream_executor/dso_loader.cc:125] successfully opened CUDA library libcufft.so.8.0 locally
I tensorflow/stream_executor/dso_loader.cc:125] successfully opened CUDA library libcuda.so.1 locally
I tensorflow/stream_executor/dso_loader.cc:125] successfully opened CUDA library libcurand.so.8.0 locally
W tensorflow/core/platform/cpu_feature_guard.cc:95] The TensorFlow library wasn't compiled to use SSE4.2 instructions, but these are available on your machine and could speed up CPU computations.
W tensorflow/core/platform/cpu_feature_guard.cc:95] The TensorFlow library wasn't compiled to use AVX instructions, but these are available on your machine and could speed up CPU computations.
I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:910] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero

I noticed that it has mentioned SSE4.2 and AVX,

  1. What are SSE4.2 and AVX?
  2. How do these SSE4.2 and AVX improve CPU computations for Tensorflow tasks.
  3. How to make Tensorflow compile using the two libraries?

解决方案

I just ran into this same problem, it seems like Yaroslav Bulatov's suggestion doesn't cover SSE4.2 support, adding --copt=-msse4.2 would suffice. In the end, I successfully built with

bazel build -c opt --copt=-mavx --copt=-mavx2 --copt=-mfma --copt=-mfpmath=both --copt=-msse4.2 --config=cuda -k //tensorflow/tools/pip_package:build_pip_package

without getting any warning or errors.

Probably the best choice for any system is:

bazel build -c opt --copt=-march=native --copt=-mfpmath=both --config=cuda -k //tensorflow/tools/pip_package:build_pip_package

(Update: the build scripts may be eating -march=native, possibly because it contains an =.)

-mfpmath=both only works with gcc, not clang. -mfpmath=sse is probably just as good, if not better, and is the default for x86-64. 32-bit builds default to -mfpmath=387, so changing that will help for 32-bit. (But if you want high-performance for number crunching, you should build 64-bit binaries.)

I'm not sure what TensorFlow's default for -O2 or -O3 is. gcc -O3 enables full optimization including auto-vectorization, but that sometimes can make code slower.


What this does: --copt for bazel build passes an option directly to gcc for compiling C and C++ files (but not linking, so you need a different option for cross-file link-time-optimization)

x86-64 gcc defaults to using only SSE2 or older SIMD instructions, so you can run the binaries on any x86-64 system. (See https://gcc.gnu.org/onlinedocs/gcc/x86-Options.html). That's not what you want. You want to make a binary that takes advantage of all the instructions your CPU can run, because you're only running this binary on the system where you built it.

-march=native enables all the options your CPU supports, so it makes -mavx512f -mavx2 -mavx -mfma -msse4.2 redundant. (Also, -mavx2 already enables -mavx and -msse4.2, so Yaroslav's command should have been fine). Also if you're using a CPU that doesn't support one of these options (like FMA), using -mfma would make a binary that faults with illegal instructions.

TensorFlow's ./configure defaults to enabling -march=native, so using that should avoid needing to specify compiler options manually.

-march=native enables -mtune=native, so it optimizes for your CPU for things like which sequence of AVX instructions is best for unaligned loads.

This all applies to gcc, clang, or ICC. (For ICC, you can use -xHOST instead of -march=native.)

这篇关于如何使用SSE4.2和AVX指令编译Tensorflow?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆