CUDA编译器生成非最佳汇编程序 [英] CUDA compiler produce unoptimal assembler

查看：291 发布时间：2017/3/4 12:30:43 optimization assembly cuda

本文介绍了CUDA编译器生成非最佳汇编程序的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我编译了流畅的简单测试内核（CUDA5，sm2.0）：

  __ device__ void TestKernel（int * pdata） 
 {
 int a0，b0，c0; 
 
 a0 = pdata [0]; 
 b0 = pdata [1]; 
 
 c0 = a0 + b0; 
 pdata [2] = c0; 
}

并期望类似流汇编程序：

  LD R3，[R0] 
 LD R4，[R0 + 4] 
 IADD R4，R4，R3 
 ST [R0 +8]，R4

但是从cuobjdump --dump-sass我看到流动更长的结果： / p>

  / * 0000 * / / * 0x10001de428000000 * / MOV R0，R4; 
 / * 0008 * / / * 0x00001de428000000 * / MOV R0，R0; 
 / * 0010 * / / * 0x00001de428000000 * / MOV R0，R0; 
 / * 0018 * / / * 0x00001de428000000 * / MOV R0，R0; 
 / * 0020 * / / * 0x0000dc8580000000 * / LD R3，[R0]; 
 / * 0028 * / / * 0x0c00dde428000000 * / MOV R3，R3; 
 / * 0030 * / / * 0x10011c034800c000 * / IADD R4，R0，0x4; 
 / * 0038 * / / * 0x10011de428000000 * / MOV R4，R4; 
 / * 0040 * / / * 0x00411c8580000000 * / LD R4，[R4]; 
 / * 0048 * / / * 0x10011de428000000 * / MOV R4，R4; 
 / * 0050 * / / * 0x1030dc0348000000 * / IADD R3，R3，R4; 
 / * 0058 * / / * 0x20001c034800c000 * / IADD R0，R0，0x8; 
 / * 0060 * / / * 0x00001de428000000 * / MOV R0，R0; 
 / * 0068 * / / * 0x0000dc8590000000 * / ST [R0]，R3; 
 / * 0070 * / / * 0x00001de790000000 * / RET; 
 / * 0078 * / / * 0x00001de780000000 * / EXIT; 
 / * 0080 * / / * 0x00001de780000000 * / EXIT;

很奇怪我MOV地址指令8,10,18,28,38,60
也不使用加载/存储指令的立即偏移。
所以改为预期4（实际上是6，包括RET，EXIT）指令我得到15
可能的原因是什么？

解决方案

你看到的几乎肯定是因为你正在编译调试打开。如果我构建你的内核，我得到这个：

  $ nvcc -arch = sm_30 -c asmprob.cu 
 $ cuobjdump -sass asmprob.o 
 
 Fatbin elf代码：
 ================ 
 arch = sm_30 
代码版本= [1,6] 
 producer = cuda 
 host = mac 
 compile_size = 32bit 
标识符= asmprob.cu 
 
 sm_30的代码
功能：_Z10TestKernelPi 
 / * 0008 * / / * 0x10005de428004001 * / MOV R1，c [0x0] [0x44] 
 / * 0010 * / / * 0x00009de428004005 * / MOV R2，c [0x0] [0x140]; 
 / * 0018 * / / * 0x10211c034800c000 * / IADD R4，R2，0x4; 
 / * 0020 * / / * 0x20209c034800c000 * / IADD R2，R2，0x8; 
 / * 0028 * / / * 0x0040dc8580000000 * / LD R3，[R4]; 
 / * 0030 * / / * 0xf0401c8583ffffff * / LD R0，[R4 + -0x4]; 
 / * 0038 * / / * 0x00301c0348000000 * / IADD R0，R3，R0; 
 / * 0048 * / / * 0x00201c8590000000 * / ST [R2]，R0; 
 / * 0050 * / / * 0x00001de780000000 * / EXIT; 
 / * 0058 * / / * 0xe0001de74003ffff * / BRA 0x58; 
 / * 0060 * / / * 0x00001de440000000 * / NOP CC.T; 
 / * 0068 * / / * 0x00001de440000000 * / NOP CC.T; 
 / * 0070 * / / * 0x00001de440000000 * / NOP CC.T; 
 / * 0078 * / / * 0x00001de440000000 * / NOP CC.T; 
 .................................

另一方面，如果我用调试设置构建它，我得到代码就像你显示：

  $ nvcc -arch = sm_30 -G -c asmprob.cu 
 $ cuobjdump -sass asmprob.o 
 
 Fatbin elf代码：
 ================ 
 arch = sm_30 
代码版本= [1,6] 
 producer = cuda 
 host = mac 
 compile_size = 32bit 
具有调试信息
压缩
标识符= asmprob.cu 
 
 sm_30的代码
功能：_Z10TestKernelPi 
 / * 0000 * / / * 0x10005de428004001 * / MOV R1，c [0x0] [0x44]; 
 / * 0008 * / / * 0x00001de218000005 * / MOV32I R0，0x140; 
 / * 0010 * / / * 0x00001c8614000000 * / LDC R0，c [0x0] [R0]; 
 / * 0018 * / / * 0x00001de428000000 * / MOV R0，R0; 
 / * 0020 * / / * 0x00009c8580000000 * / LD R2，[R0]; 
 / * 0028 * / / * 0x08009de428000000 * / MOV R2，R2; 
 / * 0030 * / / * 0x1000dc034800c000 * / IADD R3，R0,0x4; 
 / * 0038 * / / * 0x0c00dde428000000 * / MOV R3，R3; 
 / * 0040 * / / * 0x0030dc8580000000 * / LD R3，[R3]; 
 / * 0048 * / / * 0x0c00dde428000000 * / MOV R3，R3; 
 / * 0050 * / / * 0x0c209c0348000000 * / IADD R2，R2，R3; 
 / * 0058 * / / * 0x20001c034800c000 * / IADD R0，R0，0x8; 
 / * 0060 * / / * 0x00001de428000000 * / MOV R0，R0; 
 / * 0068 * / / * 0x00009c8590000000 * / ST [R0]，R2; 
 / * 0070 * / / * 0x40001de740000000 * / BRA 0x88; 
 / * 0078 * / / * 0x00001de780000000 * / EXIT; 
 / * 0080 * / / * 0x00001de780000000 * / EXIT; 
 / * 0088 * / / * 0x00001de780000000 * / EXIT; 
 / * 0090 * / / * 0x00001de780000000 * / EXIT; 
 / * 0098 * / / * 0xe0001de74003ffff * / BRA 0x98; 
 / * 00a0 * / / * 0x00001de440000000 * / NOP CC.T; 
 / * 00a8 * / / * 0x00001de440000000 * / NOP CC.T; 
 / * 00b0 * / / * 0x00001de440000000 * / NOP CC.T; 
 / * 00b8 * / / * 0x00001de440000000 * / NOP CC.T; 
 .................................

这使我认为你的问题是为什么编译器生成优化代码时，我禁用优化和编译调试器？，这是什么

p>为了避免任何怀疑，启用GPU调试禁用编译器优化，请考虑以下输出'nvcc'：
$ nvcc -arch = sm_30 -G -c --dryrun asmprob.cu ＃$ _SPACE_ = ＃$ _CUDART_ = cudart ＃$ _HERE _ = / usr / local / cuda / bin ＃$ _THERE _ = / usr / local / cuda / bin ＃$ _TARGET_SIZE_ = ＃$ TOP = / usr / local / cuda / bin / .. ＃$ PATH = usr / local / cuda / bin /../ open64 / bin：/ usr / local / cuda / bin /../ nvvm：/ usr / local / cuda / bin：/ opt / local / sbin：/Library/Frameworks/Python.framework/Versions/Current/bin：/ usr / bin：/ bin：/ usr / sbin：/ sbin：/ usr / local / bin：/ usr / local / git / bin： usr / texbin：/ usr / local / cuda / bin ＃$ INCLUDES =/ usr / texbin：/ usr / X11 / bin：/ usr / I / usr / local / cuda / bin /../ include ＃$ LIBRARIES =-L / usr / local / cuda / bin /../ lib-lcudart ＃$ CUDAFE_FLAGS = ＃$ OPENCC_FLAGS = ＃$ PTXAS_FLAGS = ＃$ gcc -D__CUDA_ARCH __ = 300 -E -x c ++ -DCUDA_DOUBLE_MATH_FUNCTIONS -D__CUDACC__ -D__NVCC__-I / usr / local / cuda / ../include-includecuda_runtime.h-m32 -malign-double -o/tmp/tmpxft_00005ceb_00000000-6_asmprob.cpp1.iiasmprob.cu ＃$ cudafe --m32 --gnu_version = 40201 -tused --no_remove_unneeded_entities --debug_mode --gen_c_file_name/tmp/tmpxft_00005ceb_00000000-3_asmprob.cudafe1.c--stub_file_name/tmp/tmpxft_00005ceb_00000000-3_asmprob.cudafe1.stub.c--gen_device_file_name/ tmp / tmpxft_00005ceb_00000000 -3_asmprob.cudafe1.gpu--nv_archcompute_30--gen_module_id_file --module_id_file_name/tmp/tmpxft_00005ceb_00000000-2_asmprob.module_id--include_file_nametmpxft_00005ceb_00000000-1_asmprob.fatbin.c/tmp/tmpxft_00005ceb_00000000-6_asmprob.cpp1 .ii ＃$ gcc -D__CUDA_ARCH __ = 300 -E -xc -DCUDA_DOUBLE_MATH_FUNCTIONS -D__CUDACC__ -D__NVCC__ -D__CUDANVVM__ -D__CUDA_PREC_DIV -D__CUDA_PREC_SQRT-I / usr / local / cuda / bin /../ include-m32 - ＃$ cudafe -w --m32 --gnu_version = 40201 --c - - - - - - - - - - - - - - - - - - - -debug_mode --gen_c_file_name/tmp/tmpxft_00005ceb_00000000-8_asmprob.cudafe2.c--stub_file_name/tmp/tmpxft_00005ceb_00000000-8_asmprob.cudafe2.stub.c--gen_device_file_name/tmp/tmpxft_00005ceb_00000000-8_asmprob.cudafe2.gpu - -nv_archcompute_30--module_id_file_name/tmp/tmpxft_00005ceb_00000000-2_asmprob.module_id--include_file_nametmpxft_00005ceb_00000000-1_asmprob.fatbin.c/tmp/tmpxft_00005ceb_00000000-7_asmprob.cpp2.i ＃$ gcc - D__CUDA_ARCH __ = 300 -E -xc -DCUDA_DOUBLE_MATH_FUNCTIONS -D__CUDABE__ -D__CUDANVVM__ -D__CUDA_PREC_DIV -D__CUDA_PREC_SQRT-I / usr / local / cuda / bin /../ include-m32 -malign-double -o/ tmp / tmpxft_00005ceb_00000000-9_asmprob。 cpp3.i/tmp/tmpxft_00005ceb_00000000-8_asmprob.cudafe2.gpu ＃$ filehash -s-g --dont-merge-basicblocks --return-at-end/ tmp / tmpxft_00005ceb_00000000-9_asmprob .cpp3.i> /tmp/tmpxft_00005ceb_00000000-10_asmprob.hash ＃$ gcc -E -x c ++ -D__CUDACC__ -D__NVCC__-I / usr / local / cuda / bin /../ include-includecuda_runtime.h -m32 -malign-double -o/tmp/tmpxft_00005ceb_00000000-4_asmprob.cpp4.iiasmprob.cu ＃$ cudafe ++ --m32 --gnu_version = 40201 --parse_templates --debug_mode --gen_c_file_name /tmp/tmpxft_00005ceb_00000000-3_asmprob.cudafe1.cpp--stub_file_nametmpxft_00005ceb_00000000-3_asmprob.cudafe1.stub.c--module_id_file_name/tmp/tmpxft_00005ceb_00000000-2_asmprob.module_id/tmp/tmpxft_00005ceb_00000000-4_asmprob.cpp4.ii ＃$ cicc -arch compute_30 -m32 -ftz = 0 -prec_div = 1 -prec_sqrt = 1 -fmad = 1 -g -O0/ tmp / tmpxft_00005ceb_00000000-11_asmprob/tmp/tmpxft_00005ceb_00000000-9_asmprob.cpp3。 i-o/tmp/tmpxft_00005ceb_00000000-5_asmprob.ptx ＃$ ptxas -arch = sm_30 -m32 -g --dont-merge-basicblocks --return-at-end/ tmp / tmpxft_00005ceb_00000000-5_asmprob .ptx-o/tmp/tmpxft_00005ceb_00000000-12_asmprob.sm_30.cubin ＃$ fatbinary --create =/ tmp / tmpxft_00005ceb_00000000-1_asmprob.fatbin-32 --key =xxxxxxxxxx--ident =asmprob.cu--cmdline =-g --dont-merge-basicblocks --return-at-end-g--image = profile = sm_30，file = / tmp / tmpxft_00005ceb_00000000-12_asmprob.sm_30。 cubin--image = profile = compute_30，file = / tmp / tmpxft_00005ceb_00000000-5_asmprob.ptx--embedded-fatbin =/ tmp / tmpxft_00005ceb_00000000-1_asmprob.fatbin.c--cuda ＃$ rm /tmp/tmpxft_00005ceb_00000000-1_asmprob.fatbin ＃$ gcc -D__CUDA_ARCH __ = 300 -E -x c ++ -DCUDA_DOUBLE_MATH_FUNCTIONS -D__CUDA_PREC_DIV -D__CUDA_PREC_SQRT-I / usr / local / cuda / bin /../ include-m32 - malign-double -o/tmp/tmpxft_00005ceb_00000000-13_asmprob.ii/tmp/tmpxft_00005ceb_00000000-3_asmprob.cudafe1.cpp ＃$ gcc -c -x c ++-I / usr / local / cuda / bin /../include-fpreprocessed -m32 -malign-double -oasmprob.o/tmp/tmpxft_00005ceb_00000000-13_asmprob.ii
请注意设备代码编译阶段命令：
cicc -arch compute_30 -m32 -ftz = 0 -prec_div = 1 -prec_sqrt = 1 -fmad = 1 -g -O0 <----
ie。调试构建使用优化设置为0的编译。

I have compiled flowing simple test kernel (CUDA5, sm2.0):
__device__ void TestKernel(int *pdata) { int a0,b0,c0; a0 = pdata[0]; b0 = pdata[1]; c0 = a0 + b0; pdata[2] = c0; }
and expect something like flowing assembler:
LD R3,[R0] LD R4,[R0+4] IADD R4,R4,R3 ST [R0+8],R4
but from cuobjdump --dump-sass I see flowing much longer result:
/*0000*/ /*0x10001de428000000*/ MOV R0, R4; /*0008*/ /*0x00001de428000000*/ MOV R0, R0; /*0010*/ /*0x00001de428000000*/ MOV R0, R0; /*0018*/ /*0x00001de428000000*/ MOV R0, R0; /*0020*/ /*0x0000dc8580000000*/ LD R3, [R0]; /*0028*/ /*0x0c00dde428000000*/ MOV R3, R3; /*0030*/ /*0x10011c034800c000*/ IADD R4, R0, 0x4; /*0038*/ /*0x10011de428000000*/ MOV R4, R4; /*0040*/ /*0x00411c8580000000*/ LD R4, [R4]; /*0048*/ /*0x10011de428000000*/ MOV R4, R4; /*0050*/ /*0x1030dc0348000000*/ IADD R3, R3, R4; /*0058*/ /*0x20001c034800c000*/ IADD R0, R0, 0x8; /*0060*/ /*0x00001de428000000*/ MOV R0, R0; /*0068*/ /*0x0000dc8590000000*/ ST [R0], R3; /*0070*/ /*0x00001de790000000*/ RET; /*0078*/ /*0x00001de780000000*/ EXIT; /*0080*/ /*0x00001de780000000*/ EXIT;
Very strange to me MOVs instruction in addresses 8,10,18,28,38,60 also the immediate offset in load/store instruction doesn't used. So instead expected 4 (actually 6 including RET,EXIT) instruction I get 15 What is possible reason?
解决方案
What you are seeing is almost certainly because you are compiling with debugging turned on. If I build your kernel I get this:
$ nvcc -arch=sm_30 -c asmprob.cu $ cuobjdump -sass asmprob.o Fatbin elf code: ================ arch = sm_30 code version = [1,6] producer = cuda host = mac compile_size = 32bit identifier = asmprob.cu code for sm_30 Function : _Z10TestKernelPi /*0008*/ /*0x10005de428004001*/ MOV R1, c [0x0] [0x44]; /*0010*/ /*0x00009de428004005*/ MOV R2, c [0x0] [0x140]; /*0018*/ /*0x10211c034800c000*/ IADD R4, R2, 0x4; /*0020*/ /*0x20209c034800c000*/ IADD R2, R2, 0x8; /*0028*/ /*0x0040dc8580000000*/ LD R3, [R4]; /*0030*/ /*0xf0401c8583ffffff*/ LD R0, [R4+-0x4]; /*0038*/ /*0x00301c0348000000*/ IADD R0, R3, R0; /*0048*/ /*0x00201c8590000000*/ ST [R2], R0; /*0050*/ /*0x00001de780000000*/ EXIT; /*0058*/ /*0xe0001de74003ffff*/ BRA 0x58; /*0060*/ /*0x00001de440000000*/ NOP CC.T; /*0068*/ /*0x00001de440000000*/ NOP CC.T; /*0070*/ /*0x00001de440000000*/ NOP CC.T; /*0078*/ /*0x00001de440000000*/ NOP CC.T; .................................
on the other hand, if I build it with debug settings, I get code just like you show:
$ nvcc -arch=sm_30 -G -c asmprob.cu $ cuobjdump -sass asmprob.o Fatbin elf code: ================ arch = sm_30 code version = [1,6] producer = cuda host = mac compile_size = 32bit has debug info compressed identifier = asmprob.cu code for sm_30 Function : _Z10TestKernelPi /*0000*/ /*0x10005de428004001*/ MOV R1, c [0x0] [0x44]; /*0008*/ /*0x00001de218000005*/ MOV32I R0, 0x140; /*0010*/ /*0x00001c8614000000*/ LDC R0, c [0x0] [R0]; /*0018*/ /*0x00001de428000000*/ MOV R0, R0; /*0020*/ /*0x00009c8580000000*/ LD R2, [R0]; /*0028*/ /*0x08009de428000000*/ MOV R2, R2; /*0030*/ /*0x1000dc034800c000*/ IADD R3, R0, 0x4; /*0038*/ /*0x0c00dde428000000*/ MOV R3, R3; /*0040*/ /*0x0030dc8580000000*/ LD R3, [R3]; /*0048*/ /*0x0c00dde428000000*/ MOV R3, R3; /*0050*/ /*0x0c209c0348000000*/ IADD R2, R2, R3; /*0058*/ /*0x20001c034800c000*/ IADD R0, R0, 0x8; /*0060*/ /*0x00001de428000000*/ MOV R0, R0; /*0068*/ /*0x00009c8590000000*/ ST [R0], R2; /*0070*/ /*0x40001de740000000*/ BRA 0x88; /*0078*/ /*0x00001de780000000*/ EXIT; /*0080*/ /*0x00001de780000000*/ EXIT; /*0088*/ /*0x00001de780000000*/ EXIT; /*0090*/ /*0x00001de780000000*/ EXIT; /*0098*/ /*0xe0001de74003ffff*/ BRA 0x98; /*00a0*/ /*0x00001de440000000*/ NOP CC.T; /*00a8*/ /*0x00001de440000000*/ NOP CC.T; /*00b0*/ /*0x00001de440000000*/ NOP CC.T; /*00b8*/ /*0x00001de440000000*/ NOP CC.T; .................................
Which makes me think that your question is "why doesn't the compiler produce optimal code when I disable optimisations and compile for the debugger?", which is something of a rhetorical question, methinks....

EDIT:

And lest there be any doubts that enabling GPU debugging disables compiler optimisation, consider the following output from ´nvcc´:
$ nvcc -arch=sm_30 -G -c --dryrun asmprob.cu #$ _SPACE_= #$ _CUDART_=cudart #$ _HERE_=/usr/local/cuda/bin #$ _THERE_=/usr/local/cuda/bin #$ _TARGET_SIZE_= #$ TOP=/usr/local/cuda/bin/.. #$ PATH=/usr/local/cuda/bin/../open64/bin:/usr/local/cuda/bin/../nvvm:/usr/local/cuda/bin:/opt/local/bin:/opt/local/sbin:/Library/Frameworks/Python.framework/Versions/Current/bin:/usr/bin:/bin:/usr/sbin:/sbin:/usr/local/bin:/usr/local/git/bin:/usr/texbin:/usr/X11/bin:/usr/NX/bin:/usr/local/bin:/Users/talonmies/bin:/usr/local/cuda/bin #$ INCLUDES="-I/usr/local/cuda/bin/../include" #$ LIBRARIES= "-L/usr/local/cuda/bin/../lib" -lcudart #$ CUDAFE_FLAGS= #$ OPENCC_FLAGS= #$ PTXAS_FLAGS= #$ gcc -D__CUDA_ARCH__=300 -E -x c++ -DCUDA_DOUBLE_MATH_FUNCTIONS -D__CUDACC__ -D__NVCC__ "-I/usr/local/cuda/bin/../include" -include "cuda_runtime.h" -m32 -malign-double -o "/tmp/tmpxft_00005ceb_00000000-6_asmprob.cpp1.ii" "asmprob.cu" #$ cudafe --m32 --gnu_version=40201 -tused --no_remove_unneeded_entities --debug_mode --gen_c_file_name "/tmp/tmpxft_00005ceb_00000000-3_asmprob.cudafe1.c" --stub_file_name "/tmp/tmpxft_00005ceb_00000000-3_asmprob.cudafe1.stub.c" --gen_device_file_name "/tmp/tmpxft_00005ceb_00000000-3_asmprob.cudafe1.gpu" --nv_arch "compute_30" --gen_module_id_file --module_id_file_name "/tmp/tmpxft_00005ceb_00000000-2_asmprob.module_id" --include_file_name "tmpxft_00005ceb_00000000-1_asmprob.fatbin.c" "/tmp/tmpxft_00005ceb_00000000-6_asmprob.cpp1.ii" #$ gcc -D__CUDA_ARCH__=300 -E -x c -DCUDA_DOUBLE_MATH_FUNCTIONS -D__CUDACC__ -D__NVCC__ -D__CUDANVVM__ -D__CUDA_PREC_DIV -D__CUDA_PREC_SQRT "-I/usr/local/cuda/bin/../include" -m32 -malign-double -o "/tmp/tmpxft_00005ceb_00000000-7_asmprob.cpp2.i" "/tmp/tmpxft_00005ceb_00000000-3_asmprob.cudafe1.gpu" #$ cudafe -w --m32 --gnu_version=40201 --c --debug_mode --gen_c_file_name "/tmp/tmpxft_00005ceb_00000000-8_asmprob.cudafe2.c" --stub_file_name "/tmp/tmpxft_00005ceb_00000000-8_asmprob.cudafe2.stub.c" --gen_device_file_name "/tmp/tmpxft_00005ceb_00000000-8_asmprob.cudafe2.gpu" --nv_arch "compute_30" --module_id_file_name "/tmp/tmpxft_00005ceb_00000000-2_asmprob.module_id" --include_file_name "tmpxft_00005ceb_00000000-1_asmprob.fatbin.c" "/tmp/tmpxft_00005ceb_00000000-7_asmprob.cpp2.i" #$ gcc -D__CUDA_ARCH__=300 -E -x c -DCUDA_DOUBLE_MATH_FUNCTIONS -D__CUDABE__ -D__CUDANVVM__ -D__CUDA_PREC_DIV -D__CUDA_PREC_SQRT "-I/usr/local/cuda/bin/../include" -m32 -malign-double -o "/tmp/tmpxft_00005ceb_00000000-9_asmprob.cpp3.i" "/tmp/tmpxft_00005ceb_00000000-8_asmprob.cudafe2.gpu" #$ filehash -s " -g --dont-merge-basicblocks --return-at-end " "/tmp/tmpxft_00005ceb_00000000-9_asmprob.cpp3.i" > "/tmp/tmpxft_00005ceb_00000000-10_asmprob.hash" #$ gcc -E -x c++ -D__CUDACC__ -D__NVCC__ "-I/usr/local/cuda/bin/../include" -include "cuda_runtime.h" -m32 -malign-double -o "/tmp/tmpxft_00005ceb_00000000-4_asmprob.cpp4.ii" "asmprob.cu" #$ cudafe++ --m32 --gnu_version=40201 --parse_templates --debug_mode --gen_c_file_name "/tmp/tmpxft_00005ceb_00000000-3_asmprob.cudafe1.cpp" --stub_file_name "tmpxft_00005ceb_00000000-3_asmprob.cudafe1.stub.c" --module_id_file_name "/tmp/tmpxft_00005ceb_00000000-2_asmprob.module_id" "/tmp/tmpxft_00005ceb_00000000-4_asmprob.cpp4.ii" #$ cicc -arch compute_30 -m32 -ftz=0 -prec_div=1 -prec_sqrt=1 -fmad=1 -g -O0 "/tmp/tmpxft_00005ceb_00000000-11_asmprob" "/tmp/tmpxft_00005ceb_00000000-9_asmprob.cpp3.i" -o "/tmp/tmpxft_00005ceb_00000000-5_asmprob.ptx" #$ ptxas -arch=sm_30 -m32 -g --dont-merge-basicblocks --return-at-end "/tmp/tmpxft_00005ceb_00000000-5_asmprob.ptx" -o "/tmp/tmpxft_00005ceb_00000000-12_asmprob.sm_30.cubin" #$ fatbinary --create="/tmp/tmpxft_00005ceb_00000000-1_asmprob.fatbin" -32 --key="xxxxxxxxxx" --ident="asmprob.cu" --cmdline=" -g --dont-merge-basicblocks --return-at-end " -g "--image=profile=sm_30,file=/tmp/tmpxft_00005ceb_00000000-12_asmprob.sm_30.cubin" "--image=profile=compute_30,file=/tmp/tmpxft_00005ceb_00000000-5_asmprob.ptx" --embedded-fatbin="/tmp/tmpxft_00005ceb_00000000-1_asmprob.fatbin.c" --cuda #$ rm /tmp/tmpxft_00005ceb_00000000-1_asmprob.fatbin #$ gcc -D__CUDA_ARCH__=300 -E -x c++ -DCUDA_DOUBLE_MATH_FUNCTIONS -D__CUDA_PREC_DIV -D__CUDA_PREC_SQRT "-I/usr/local/cuda/bin/../include" -m32 -malign-double -o "/tmp/tmpxft_00005ceb_00000000-13_asmprob.ii" "/tmp/tmpxft_00005ceb_00000000-3_asmprob.cudafe1.cpp" #$ gcc -c -x c++ "-I/usr/local/cuda/bin/../include" -fpreprocessed -m32 -malign-double -o "asmprob.o" "/tmp/tmpxft_00005ceb_00000000-13_asmprob.ii"
Note the device code compilation phase command:
cicc -arch compute_30 -m32 -ftz=0 -prec_div=1 -prec_sqrt=1 -fmad=1 -g -O0 <----
ie. debugging builds compile with optimisation set to 0.

这篇关于CUDA编译器生成非最佳汇编程序的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

CUDA编译器生成非最佳汇编程序 [英] CUDA compiler produce unoptimal assembler

问题描述

相关文章

其它硬件开发最新文章

热门教程

热门工具

登录关闭

CUDA编译器生成非最佳汇编程序 [英] CUDA compiler produce unoptimal assembler

问题描述

相关文章

其它硬件开发最新文章

热门教程

热门工具

登录 关闭

登录关闭