在CUDA中检查矩阵稳定性的有效方法 [英] Efficient method to check for matrix stability in CUDA

查看:157
本文介绍了在CUDA中检查矩阵稳定性的有效方法的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

许多算法迭代,直到达到一定的收敛准则(例如特定矩阵的稳定性)。在许多情况下,必须每次迭代启动一个CUDA内核。我的问题是:如何有效地和准确地确定矩阵是否在最后一次内核调用的过程中发生了变化?这里有三个似乎同样不令人满意的可能性:




  • 每次在内核中修改矩阵时写入全局标志。这种方法非常有效,但是效率非常低,并且在技术上不是线程安全的。

  • 使用原子操作执行上面的操作。同样,这似乎是低效的,因为在最坏情况下每个线程发生一个全局写入。

  • 使用缩减内核来计算矩阵的一些参数(例如sum,mean,variance)。这在某些情况下可能更快,但仍似乎过度。此外,可以设想矩阵已更改但总和/平均值/方差未更改(例如两个元素已交换)的情况。



有以上三个选项中的任何一个,或者一个被认为是最佳实践和/或通常更高效的替代方案?

解决方案

我还会回到我将在2012年发布的答案,但浏览器崩溃。



基本思想是你可以使用warp投票指令执行简单,便宜的减少,然后对每个块使用零或一个原子操作来更新一个固定的映射标志主机可以在每次内核启动后读取。使用映射标志消除了在每个内核启动后显式设备主机传输的需要。



这需要在内核中每个warp的一个字的共享内存,这是一个小的开销和一些模板技巧可以允许循环展开,如果你提供每个块的翘曲数作为模板参数。



一个完整的工作examplate ,我现在没有访问一个工作的PyCUDA安装)看起来像这样:

  #include< cstdlib& 
#include< vector>
#include< algorithm>
#include< assert.h>

__device__ unsigned int process(int& val)
{
return(++ val< 10);
}

template< int nwarps>
__global__ void kernel(int * inout,unsigned int * kchanged)
{
__shared__ int wchanged [nwarps];
unsigned int laneid = threadIdx.x%warpSize;
unsigned int warpid = threadIdx.x / warpSize;

//进行计算然后检查变化/收敛
//,如果需要,设置tchanged为!= 0
int idx = blockIdx.x * blockDim.x + threadIdx 。X;
unsigned int tchanged = process(inout [idx]);

//使用投票原语简单的逐块减少
//增量kchanged是块中的任何线程
//返回tchanged!= 0
tchanged = __any(tchanged != 0);
if(streetsid == 0){
wchanged [warpid] = tchanged;
}
__syncthreads();

if(threadIdx.x == 0){
int bchanged = 0;
#pragma unroll
for(int i = 0; i bchanged | = wchanged [i];
}
if(bchanged){
atomicAdd(kchanged,1);
}
}
}

int main(void)
{
const int N = 2048;
const int min = 5,max = 15;
std :: vector< int>数据(N);
for(int i = 0; i data [i] = min +(std :: rand()%(int)(max- min + 1)
}

int * _data;
size_t datasz = sizeof(int)*(size_t)N;
cudaMalloc< int>(& _data,datasz);
cudaMemcpy(_data,& data [0],datasz,cudaMemcpyHostToDevice);

unsigned int * kchanged,* _kchanged;
cudaHostAlloc((void **)& kchanged,sizeof(unsigned int),cudaHostAllocMapped);
cudaHostGetDevicePointer((void **)& _kchanged,kchanged,0);

const int nwarps = 4;
dim3 blcksz(32 * nwarps),grdsz(16);

//内核指示它需要再次运行时循环
do {
* kchanged = 0;
kernel< nwarps><<< grdsz,blcksz>>>(_ data,_kchanged);
cudaDeviceSynchronize();
} while(* kchanged!= 0);

cudaMemcpy(& data [0],_data,datasz,cudaMemcpyDeviceToHost);
cudaDeviceReset();

int minval = * std :: min_element(data.begin(),data.end());
assert(minval == 10);

return 0;
}

这里, kchanged 是内核使用的标志,它需要再次运行到主机。内核运行,直到输入中的每个条目已经增加到高于阈值。在每个线程处理结束时,它参与经纬投票,之后来自每个经线的一个线程将投票结果加载到共享存储器。一个线程减少了warp的结果,然后自动更新 kchanged 值。主线程等待,直到设备完成,然后可以直接从映射的主机变量读取结果。



您应该能够适应您的应用程序需要


A number of algorithms iterate until a certain convergence criterion is reached (e.g. stability of a particular matrix). In many cases, one CUDA kernel must be launched per iteration. My question is: how then does one efficiently and accurately determine whether a matrix has changed over the course of the last kernel call? Here are three possibilities which seem equally unsatisfying:

  • Writing a global flag each time the matrix is modified inside the kernel. This works, but is highly inefficient and is not technically thread safe.
  • Using atomic operations to do the same as above. Again, this seems inefficient since in the worst case scenario one global write per thread occurs.
  • Using a reduction kernel to compute some parameter of the matrix (e.g. sum, mean, variance). This might be faster in some cases, but still seems like overkill. Also, it is possible to dream up cases where a matrix has changed but the sum/mean/variance haven't (e.g. two elements are swapped).

Is there any of the three options above, or an alternative, that is considered best practice and/or is generally more efficient?

解决方案

I'll also go back to the answer I would have posted in 2012 but for a browser crash.

The basic idea is that you can use warp voting instructions to perform a simple, cheap reduction and then use zero or one atomic operations per block to update a pinned, mapped flag that the host can read after each kernel launch. Using a mapped flag eliminates the need for an explicit device to host transfer after each kernel launch.

This requires one word of shared memory per warp in the kernel, which is a small overhead, and some templating tricks can allow for loop unrolling if you provide the number of warps per block as a template parameter.

A complete working examplate (with C++ host code, I don't have access to a working PyCUDA installation at the moment) looks like this:

#include <cstdlib>
#include <vector>
#include <algorithm>
#include <assert.h>

__device__ unsigned int process(int & val)
{
    return (++val < 10);
}

template<int nwarps>
__global__ void kernel(int *inout, unsigned int *kchanged)
{
    __shared__ int wchanged[nwarps];
    unsigned int laneid = threadIdx.x % warpSize;
    unsigned int warpid = threadIdx.x / warpSize;

    // Do calculations then check for change/convergence 
    // and set tchanged to be !=0 if required
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    unsigned int tchanged = process(inout[idx]);

    // Simple blockwise reduction using voting primitives
    // increments kchanged is any thread in the block 
    // returned tchanged != 0
    tchanged = __any(tchanged != 0);
    if (laneid == 0) {
        wchanged[warpid] = tchanged;
    }
    __syncthreads();

    if (threadIdx.x == 0) {
        int bchanged = 0;
#pragma unroll
        for(int i=0; i<nwarps; i++) {
            bchanged |= wchanged[i];
        }
        if (bchanged) {
            atomicAdd(kchanged, 1);
        }
    }
}

int main(void)
{
    const int N = 2048;
    const int min = 5, max = 15;
    std::vector<int> data(N);
    for(int i=0; i<N; i++) {
        data[i] = min + (std::rand() % (int)(max - min + 1));
    }

    int* _data;
    size_t datasz = sizeof(int) * (size_t)N;
    cudaMalloc<int>(&_data, datasz);
    cudaMemcpy(_data, &data[0], datasz, cudaMemcpyHostToDevice);

    unsigned int *kchanged, *_kchanged;
    cudaHostAlloc((void **)&kchanged, sizeof(unsigned int), cudaHostAllocMapped);
    cudaHostGetDevicePointer((void **)&_kchanged, kchanged, 0);

    const int nwarps = 4;
    dim3 blcksz(32*nwarps), grdsz(16);

    // Loop while the kernel signals it needs to run again
    do {
        *kchanged = 0;
        kernel<nwarps><<<grdsz, blcksz>>>(_data, _kchanged);
        cudaDeviceSynchronize(); 
    } while (*kchanged != 0); 

    cudaMemcpy(&data[0], _data, datasz, cudaMemcpyDeviceToHost);
    cudaDeviceReset();

    int minval = *std::min_element(data.begin(), data.end());
    assert(minval == 10);

    return 0;
}

Here, kchanged is the flag the kernel uses to signal it needs to run again to the host. The kernel runs until each entry in the input has been incremented to above a threshold value. At the end of each threads processing, it participates in a warp vote, after which one thread from each warp loads the vote result to shared memory. One thread reduces the warp result and then atomically updates the kchanged value. The host thread waits until the device is finished, and can then directly read the result from the mapped host variable.

You should be able to adapt this to whatever your application requires

这篇关于在CUDA中检查矩阵稳定性的有效方法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆