CUDA内核有限元大会 [英] CUDA kernel for Finite Element Assembly

查看:192
本文介绍了CUDA内核有限元大会的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我们有一个包含以下格式的非结构化四面体网格文件:

We have an unstructured tetrahedral mesh file containing following format:

element-ID  nod1 nod2 nod3 nod4


    1            452  3434  322 9000

    2            2322   837 6673 2323

    .
    .
    .

300000

我们划分为每2048分区大小上面的网格。
对于2048每个分区的大小包含唯一NOD1 NOD2 nod3 nod4价值,我们通过1个街区,并在不同的起始索引512个线程。

We partitioned the above mesh for partition size of 2048 each. For each partition size of 2048 contains unique nod1 nod2 nod3 nod4 values, we pass 1 block and 512 threads at different start index.

在一个CUDA文件中,我们有

In a cuda file, we have

__global__ void calc(double d_ax,int *nod1,int *node2,int *nod3,int *nod4,int   start,int size)
{
    int n1,n2,n3,n4;     
    int i = blockIdx.x * blockDim.x + threadIdx.x + start;


    if ( i < size )
    {

        n1=nod1[i];
        n2=nod2[i];
        n3=nod3[i];
        n4=nod4[i];

        ax[n1] += some code;
        ax[n2] += some code;
        ax[n3] += some code;
        ax[n4] += some code;
    }
}

我们调用内核

calc<<<1,512>>>(d_ax,....,0,512);
calc<<<1,512>>>(d_ax,....,512,512);
calc<<<1,512>>>(d_ax,....,1024,512); 
calc<<<1,512>>>(d_ax,....1536,512);

以上code效果很好,但问题是,我们得到的不同的结果的使用多个块一次。例如:

the above code works well but the problem is we get different results using more than one block at a time. For example:

calc<<<2,512>>>(d_ax,....,0,1024); 
calc<<<2,512>>>(d_ax,....,1024,1024); 

谁能帮我?

推荐答案

我不知道你如何指望任何人告诉你什么可能是错误的,当你已经张贴了code是不完整的,不可编译的,但是的如果的在你真的是调用内核您已发布的单人拦网的情况下,这是会发生什么:

I am not sure how you expect anyone to tell you what might be wrong when the code you have posted is incomplete and uncompilable, but if in your single block case you really are calling the kernel as you have posted, this is what should happen:

calc<<<1,512>>>(d_ax,....,0,512);    // process first 512 elements
calc<<<1,512>>>(d_ax,....,512,512);  // start >= 512, size == 512, does nothing
calc<<<1,512>>>(d_ax,....,1024,512); // start >= 1024, size == 512, does nothing
calc<<<1,512>>>(d_ax,....1536,512);  // start >= 1536, size == 512, does nothing

所以,无论是否使用多个模块在运行时的code可能会被打破,结果为单块的情况下,很可能是错误的,你的问题的整点可能是无关紧要的结果。

So irrespective of whether your code might be broken when run using multiple blocks, your results for the single block case are probably wrong, and the whole point of your question is probably irrelevant as a result.

如果你想有一个更好的答案,请编辑您的问题,以便它包含实际上可以被编译的问题,简洁,完整code的完整描述。否则,这是关于不亚于任何人都可以从您提供的信息猜测。

If you want a better answer, edit your question so it contains a complete description of the problem and concise, complete code that could actually be compiled. Otherwise this is about as much as anybody could guess from the information you have provided.

这篇关于CUDA内核有限元大会的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆