不明白为什么在CUDA中列添加比行添加更快 [英] Don't understand why column addition faster than row in CUDA

查看:50
本文介绍了不明白为什么在CUDA中列添加比行添加更快的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我从CUDA开始,并编写了两个用于实验的内核.乳清都接受3个指向n * n(矩阵仿真)和n数组的指针.

I started with CUDA and wrote two kernels for experiment. Whey both accept 3 pointers to array of n*n (matrix emulation) and n.

__global__
void th_single_row_add(float* a, float* b, float* c, int n) {
  int idx = blockDim.x * blockIdx.x * n + threadIdx.x * n;
  for (int i = 0; i < n; i ++) {
    if (idx + i >= n*n) return;
    c[idx + i] = a[idx + i] + b[idx + i];
  }
}

__global__
void th_single_col_add(float* a, float* b, float* c, int n) {
  int idx = blockDim.x * blockIdx.x + threadIdx.x;
  for (int i = 0; i < n; i ++) {
    int idx2 = idx + i * n;
    if (idx2 >= n*n) return;
    c[idx2] = a[idx2] + b[idx2];
  }
}

每个线程中的 th_single_row_add 中的行对 n 个元素的行求和,在 th_single_col_add 中的每个线程中的行求和.这是 n = 1000 (1000万个元素)

In th_single_row_add each thread sum rows on n elemnts, In th_single_col_add each thread sum columns. Here is profile on n = 1000 (1 000 000 elements)

986.29us  th_single_row_add(float*, float*, float*, int)
372.96us  th_single_col_add(float*, float*, float*, int)

如您所见,列的求和速度快三倍.我以为,因为在 column 变体中,循环中的所有索引都相距很远,所以应该变慢,我在哪里弄错了?

As you see colums sum three times faster. I thought that because in the column variant all indexes in the loop are far from each other it should be slower, where I wrong?

推荐答案

CUDA中的线程不是单独起作用的,它们是

Threads in CUDA don't act individually, they are grouped together in warps of 32 threads. Those 32 threads execute in lockstep (usually). An instruction issued to one thread is issued to all 32 at the same time, in the same clock cycle.

例如,如果该指令是读取存储器的指令,则可能需要/请求多达32个独立的读取.满足这些读取操作所需的地址的确切模式由您编写的代码确定.如果这些地址在内存中全部相邻",那将是一个有效的读取.如果这些地址以某种方式分散"在内存中,那将是低效的读取,并且速度会变慢.

If that instruction is an instruction that reads memory (for example), then up to 32 independent reads may be required/requested. The exact patterns of addresses needed to satisfy these read operations is determined by the code you write. If those addresses are all "adjacent" in memory, that will be an efficient read. If those addresses are somehow "scattered" in memory, that will be an inefficient read, and will be slower.

刚刚描述的这个基本概念在CUDA中称为合并访问".您的列求和的情况允许在一个warp中合并访问,因为warp中每个线程生成的地址在相邻的列中,并且这些位置在内存中是相邻的.您的行求和大小写打破了这一点.经线中每个线程生成的地址不是相邻的(它们是列",彼此之间按数组的宽度分开),因此不合并".

This basic concept just described is called "coalesced" access in CUDA. Your column-summing case allows for coalesced access across a warp, because the addresses generated by each thread in the warp are in adjacent columns, and the locations are adjacent in memory. Your row summing case breaks this. The addresses generated by each thread in the warp are not adjacent (they are "columnar", separated from each other by the width of your array) and are therefore not "coalesced".

性能上的差异是由于内存访问效率上的差异.

The difference in performance is due to this difference in memory access efficiency.

您可以通过研究CUDA优化的介绍性方法,例如

You can study more about coalescing behavior in CUDA by studying an introductory treatment of CUDA optimization, such as here especially slides 44-54.

这篇关于不明白为什么在CUDA中列添加比行添加更快的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆