发布版本中的切片性能比调试版本中的差 [英] Tiling performance in release build is worse than that in debug build

查看:62
本文介绍了发布版本中的切片性能比调试版本中的差的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在检查通过使用tile_static内存进行矩阵乘法获得的性能.内核与从中下载的内核相同 矩阵乘法样本.我使用的是精确的代码来测量时间,该时间在Kate Gregory的书的第8章中找到 C ++ AMP:使用QueryPerformanceCounter函数的Microsoft®Visual C ++加速大规模并行处理.

I'm examing the performance gained by using tile_static memory for matrix multiplication. The kernel is identical to that downloaded from Matrix Multiplication Sample. And I'm using exact code for measuring time found in Chapter 8 of Kate Gregory's book C++ AMP: Accelerated Massive Parallelism with Microsoft® Visual C++, which is using QueryPerformanceCounter function.

在我的代码中,两个矩阵均为1024 x 1024,并且图块大小为16 x16.我希望平铺版本的性能应比简单版本好.但是我对计算机上的结果感到惊讶.我的GPU是AMD Radeon HD 6750M.

In my code, both matrixes were 1024 x 1024, and the tile size was 16 x 16. I expect the tiling version should perform better than the simple version. But I was surprised with the results on my computer. My GPU is AMD Radeon HD 6750M.

当我在 debug 中构建代码并在cmd中运行时,记录了时间:

When I built the code in debug, and ran it in cmd, the times were recorded:

  • 简单模式:637.999(毫秒)
  • 平铺模式:125.559(毫秒)

当我在 release 中构建代码并在cmd中运行时,记录了时间:

And when I built the code in release, and ran it in cmd, the times were recorded:

  • 简单模式:166.39(ms)
  • 平铺模式:204.539(ms)

我发现:

  1. 发布中的简单模式比调试模式下表现更好
  2. 发布中的平铺模式比调试模式下表现差
  3. 调试中的平铺模式比简单模式下表现更好
  4. 发布中的平铺模式比简单模式差

我确保运行时初始化时间和JIT时间不包括在测量中,并且内核确实在上述GPU上运行.我还尝试了不同的矩阵大小和图块大小,但观察到了相似的结果.我不明白2和4 在上面的列表中.

I've made sure the run time initialization time and JIT time were not included in the measurement, and the kernel indeed ran on the above GPU. I've also tried different matrix sizes and tile sizes, but observed similar results. I don't understand 2 and 4 in the above list.

所以我的问题是那里发生了什么?有任何解释或类似经验吗?

So my question is what was going on there? Any explaination or similar experience? Thanks in advance.

推荐答案

我非常期待看到答案!

I'm very much looking forward to seeing the answer to this!

-L


这篇关于发布版本中的切片性能比调试版本中的差的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆