使部分（而非全部）（CUDA）内存访问未缓存 [英] Making some, but not all, (CUDA) memory accesses uncached

查看：71 发布时间：2020/9/28 6:25:55 caching cuda gpgpu

本文介绍了使部分（而非全部）（CUDA）内存访问未缓存的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

我刚刚注意到，完全可能有（CUDA内核）内存访问未缓存（请参见例如有关此问题的答案）。

I just noticed it's at all possible to have (CUDA kernel) memory accesses uncached (see e.g. this answer here on SO).

可以做到吗...

仅当您单独编译该内核时，因为这是指令级功能，通过代码生成启用。您还可以使用内联PTX汇编器为内核中的特定加载操作发出 ld.global.cg 指令[请参见此处（了解详情）。

不，它是PTX的指令级功能。您可以在运行时JIT一个包含非缓存内存加载的代码版本，但是从技术上讲，这仍然是编译的。您可能会使用一些模板技巧和单独的编译来获取运行时，以保存使用或不使用缓存构建的同一代码的两个版本，并在运行时在这些版本之间进行选择。您也可以使用相同的技巧来获取给定内核的两个版本，这些版本不带或不带内联PTX来处理未缓存的负载[请参见此处

Only if you compile that kernel individually, because this is an instruction level feature which is enabled by code generation. You could also use inline PTX assembler to issue ld.global.cg instructions for a particular load operation within a kernel [see here for details].
No, it is an instruction level feature of PTX. You can JIT a version of code containing non-caching memory loads at runtime, but that is still technically compilation. You could probably use some template tricks and separate compilation to get the runtime to hold two versions of the same code built with or without caching and choose between those versions at runtime. You could also use the same tricks to get two versions of a given kernel without or without inline PTX for uncached loads [see here for one possibility of achieving this]
These non-caching instructions bypass the L1 cache with byte level granularity to L2 cache. So they are load only (all writes invalidate L1 cache and store to L2).

这篇关于使部分（而非全部）（CUDA）内存访问未缓存的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文