*修改* Nvidia Maxwell,增加了全局存储器指令数量 [英] *Modified* Nvidia Maxwell, increased global memory instruction count

查看:128
本文介绍了*修改* Nvidia Maxwell,增加了全局存储器指令数量的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我使用基准(Parboil,Rodinia)对GTX760(开普勒)和GTX750Ti(麦克斯韦)进行了实验.然后,我使用Nvidia视觉分析器分析了结果.在大多数应用程序中,在Maxwell架构上,全局指令的数量已极大地增加了多达7到10倍.

I had an experiment on both GTX760(Kepler) and GTX750Ti(Maxwell) using benchmarks(Parboil, Rodinia). Then I analyzed results using Nvidia visual profiler. In most of the applications, the number of global instructions are enormously increased up to 7-10 times on Maxwell architecture.

规格对于两个图形卡

GTX760 6.0Gbps 2048MB 256位192.2 GB/s

GTX760 6.0Gbps 2048MB 256bit 192.2 GB/s

GTX750Ti 5.4Gbps 2048MB 128bit 86.4Gb/s

GTX750Ti 5.4Gbps 2048MB 128bit 86.4Gb/s

Ubuntu 14.04

Ubuntu 14.04

CUDA驱动程序340.29

CUDA driver 340.29

工具包6.5

我编译了基准应用程序(无修改),然后从NVVP(6.5)收集了结果.分析所有>内核内存>从"L1/共享内存"部分,我收集了全局加载事务计数.

I compiled the benchmark application(No modification) then I collected the results from NVVP(6.5). Analyze all > Kernel Memory > From L1/Shared Memory section, I collected global load transaction counts.

我附上了我们的histo模拟结果的屏幕截图,该截图在 kepler(link) maxwell(link )

I attached screenshots of our simulation result of histo ran on kepler(link) and maxwell(link)

任何人都知道为什么Maxwell架构上的全局指令计数增加了吗?

Anyone know why the number of global instruction counts are increased on Maxwell architecture?

谢谢.

推荐答案

在Kepler和Maxwell架构之间,计数器gld_transactions无法比拟.此外,这不等于执行的全局指令的数量.

The counter gld_transactions is not comparable between Kepler and Maxwell architecture. Furthermore, this is not equivalent to the count of global instructions executed.

在Fermi/Kepler上,此操作计算SM到L1 128字节请求的数量.每执行一条全局/通用指令,它可以从0-32递增.

On Fermi/Kepler this counts the number of SM to L1 128 byte requests. This can increment from 0-32 per global/generic instruction executed.

在Maxwell上,全局操作全部通过TEX(统一缓存)进行. TEX缓存与Fermi/Kepler L1缓存完全不同.全局事务衡量在缓存中访问的32B扇区的数量.每执行一条全局/通用指令,它可以从0-32递增.

On Maxwell global operations all go through the TEX (unified cache). The TEX cache is completely different from the Fermi/Kepler L1 cache. Global transactions measure the number of 32B sectors accessed in the cache. This can increment from 0-32 per global/generic instruction executed.

如果我们看三种情况:

情况1:warp中的每个线程都访问相同的32位偏移量.

CASE 1: Each thread in a warp accesses the same 32-bit offset.

情况2:warp中的每个线程都以128字节的跨度访问32位偏移.

CASE 2: Each thread in a warp accesses a 32-bit offset with a 128 byte stride.

情况3:warp中的每个线程根据其通​​道索引访问唯一的32位偏移.

CASE 3: Each thread in a warp accesses a unique 32-bit offset based upon its lane index.

情况4:warp中的每个线程在128字节对齐的128字节内存范围内访问唯一的32位偏移量.

CASE 4: Each thread in a warp accesses a unique 32-bit offset in a 128 byte memory range that is 128-byte aligned.

每个列表案例的gld_transcations(按体系结构)

gld_transcations for each list case by architecture

            Kepler      Maxwell
Case 1      1           4
Case 2      32          32
Case 3      1           8
Case 4      1           4-16

我的建议是避免查看gld_transactions.未来版本的CUDA探查器应使用不同的指标,这些指标更具可操作性,并且与过去的体系结构可比.

My recommendation is to avoid looking at gld_transactions. A future version of the CUDA profilers should use different metrics that are more actionable and comparable to past architectures.

我建议您查看l2_ {读,写} _ {事务,吞吐量}.

I would recommend looking at l2_{read, write}_{transactions, throughput}.

这篇关于*修改* Nvidia Maxwell,增加了全局存储器指令数量的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆