使用PTX在C ++ / CUDA程序中计算浮点运算的方法 [英] A Method of counting Floating Point Operations in a C++/CUDA Program using PTX

查看:629
本文介绍了使用PTX在C ++ / CUDA程序中计算浮点运算的方法的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个有点大的CUDA应用程序,我需要计算获得的GFLOPs。
我正在寻找一种简单而且通用的计算浮点运算数的方法。

I have a somewhat large CUDA application and I need to calculate the attained GFLOPs. I'm looking for an easy and perhaps generic way of counting the number of floating point operations.

可以从生成的浮点运算计数PTX代码(如下所示),使用汇编语言中的预定义fpo的列表?基于代码,计数可以通用吗?例如, add.s32%r58,%r8,-2; 计为一个浮点运算?

Is it possible to count floating point operations from the generated PTX code (as shown below), using a list of predefined fpo in assembly language? Based on the code, can the counting be made generic? For example, does add.s32 %r58, %r8, -2; count as one floating point operation?

EXAMPLE:

BB3_2:
.loc 2 108 1
mov.u32         %r8, %r79;
setp.ge.s32     %p1, %r78, %r16;
setp.lt.s32     %p2, %r78, 0;
or.pred         %p3, %p2, %p1;
@%p3 bra        BB3_5;

add.s32         %r58, %r8, -2;
setp.lt.s32     %p4, %r58, 0;
setp.ge.s32     %p5, %r58, %r15;
or.pred         %p6, %p4, %p5;
@%p6 bra        BB3_5;

.loc 2 112 1
ld.global.u8    %rc1, [%rd17];
cvt.rn.f32.u8   %f11, %rc1;
mul.wide.u32    %rd12, %r80, 4;
add.s64         %rd13, %rd7, %rd12;
ld.local.f32    %f12, [%rd13];
fma.rn.f32      %f14, %f11, %f12, %f14;
.loc 2 113 1
add.f32         %f15, %f15, %f12;

或者有更简单的计数FPO的方法,这是浪费时间吗?

Or are there far simpler ways of counting FPOs and this is a waste of time?

推荐答案

计算FLOPS的最简单方法是使用CUDA profiler为您完成。通过选择已实现FLOPS 实验,您可以获得以下图表:

The easiest way to count FLOPS would be to have the CUDA profiler do it for you. By selecting the Achieved FLOPS experiment, you can get charts like this:

浮点操作图表显示您的内核执行的每种类型的浮点操作的计数。

The Floating Point Operations chart displays a count of each type of floating point operation executed by your kernel.

这篇关于使用PTX在C ++ / CUDA程序中计算浮点运算的方法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆