fmad = false给出良好的性能 [英] fmad=false gives good performance

查看:677
本文介绍了fmad = false给出良好的性能的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

从Nvidia发行说明:

From Nvidia release notes:

 The nvcc compiler switch, --fmad (short name: -fmad), to control the contraction of    
 floating-point multiplies and add/subtracts into floating-point multiply-add   
 operations (FMAD, FFMA, or DFMA) has been added: 
 --fmad=true and --fmad=false enables and disables the contraction respectively. 
 This switch is supported only when the --gpu-architecture option is set with     
 compute_20, sm_20, or higher. For other architecture classes, the contraction is     
  always enabled. 
 The --use_fast_math option implies --fmad=true, and enables the contraction.

我有两个内核 - 一个是纯粹的计算绑定与许多乘法,而另一个是内存界。我注意到,对于我的计算密集型内核,当我执行 -fmad = false ...时性能一致的改善(约5%)关闭我的内存绑定内核。
所以,FMA对我的内存绑定内核工作更好,但我的计算绑定内核可以通过关闭它挤压一些性能。
可能是什么原因?
我的设备是M2090,我使用CUDA 4.2。

I have two kernels - one is purely compute bound with lots of multiplications, whereas the other one is memory bound. I notice a consistent improvement in performance (around 5%) for my compute intensive kernel when I do -fmad=false...and around the same percent decline in performance when I turn it off for my memory bound kernel. So, FMA is working better for my memory bound kernel, but my compute bound kernel could squeeze a little performance by turning it off. What could be the reason? My device is M2090 and I am using CUDA 4.2.

完全编译选项:
ftz = true,-prec-div = false,-prec-sqrt = false,-use_fast_math,-fmad = false (或者我只是删除 fmad = false

Full compilation options: -arch,sm_20,-ftz=true,-prec-div=false,-prec-sqrt=false,-use_fast_math,-fmad=false (or I just remove fmad=false because that's the default anyway.

推荐答案

使用FMA可能会稍微增加寄存器压力,因为三个源操作数必须可用因此打开/关闭FMA生成可能导致指令调度和寄存器分配的小差异,这反过来可能导致小的性能差异对于具有许多乘法添加成语的计算绑定内核,-fmad = true应该会产生显着的性能差异,但是正如你所说,你的内核被乘法所支配,因此很少受到FMA的使用的影响,任何增益可能会被寄存器压力/指令调度方面所抵消。

Use of FMA may increase register pressure slightly, because three source operands must be available at the same time. So turning FMA generation on / off can lead to small differences in instruction scheduling and register allocation, which in turn can lead to small performance differences. For a compute-bound kernel with many multiply-add idioms, -fmad=true should make a significant performance difference, but as you say, your kernel is dominated by multiplies and thus will benefit little from use of FMA, and any gains may be offset by the register pressure / instruction scheduling aspects

这篇关于fmad = false给出良好的性能的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆