使用PTX在C ++ / CUDA程序中计算浮点运算的方法 [英] A Method of counting Floating Point Operations in a C++/CUDA Program using PTX
问题描述
我有一个有点大的CUDA应用程序,我需要计算获得的GFLOPs。
我正在寻找一种简单而且通用的计算浮点运算数的方法。
I have a somewhat large CUDA application and I need to calculate the attained GFLOPs. I'm looking for an easy and perhaps generic way of counting the number of floating point operations.
可以从生成的浮点运算计数PTX代码(如下所示),使用汇编语言中的预定义fpo的列表?基于代码,计数可以通用吗?例如, add.s32%r58,%r8,-2;
计为一个浮点运算?
Is it possible to count floating point operations from the generated PTX code (as shown below), using a list of predefined fpo in assembly language? Based on the code, can the counting be made generic? For example, does add.s32 %r58, %r8, -2;
count as one floating point operation?
EXAMPLE:
BB3_2:
.loc 2 108 1
mov.u32 %r8, %r79;
setp.ge.s32 %p1, %r78, %r16;
setp.lt.s32 %p2, %r78, 0;
or.pred %p3, %p2, %p1;
@%p3 bra BB3_5;
add.s32 %r58, %r8, -2;
setp.lt.s32 %p4, %r58, 0;
setp.ge.s32 %p5, %r58, %r15;
or.pred %p6, %p4, %p5;
@%p6 bra BB3_5;
.loc 2 112 1
ld.global.u8 %rc1, [%rd17];
cvt.rn.f32.u8 %f11, %rc1;
mul.wide.u32 %rd12, %r80, 4;
add.s64 %rd13, %rd7, %rd12;
ld.local.f32 %f12, [%rd13];
fma.rn.f32 %f14, %f11, %f12, %f14;
.loc 2 113 1
add.f32 %f15, %f15, %f12;
或者有更简单的计数FPO的方法,这是浪费时间吗?
Or are there far simpler ways of counting FPOs and this is a waste of time?
推荐答案
计算FLOPS的最简单方法是使用CUDA profiler为您完成。通过选择已实现FLOPS
实验,您可以获得以下图表:
The easiest way to count FLOPS would be to have the CUDA profiler do it for you. By selecting the Achieved FLOPS
experiment, you can get charts like this:
浮点操作
图表显示您的内核执行的每种类型的浮点操作的计数。
The Floating Point Operations
chart displays a count of each type of floating point operation executed by your kernel.
这篇关于使用PTX在C ++ / CUDA程序中计算浮点运算的方法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!