Cuda程序不工作超过1024个线程 [英] Cuda program not working for more than 1024 threads

查看:1052
本文介绍了Cuda程序不工作超过1024个线程的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我的程序是奇数甚至合并排序,它不工作超过1024个线程。

My program is of Odd-even merge sort and it's not working for more than 1024 threads.

我已经尝试将块大小增加到100,但它仍然不能工作超过1024个线程。

I have already tried increasing the block size to 100 but it still not working for more than 1024 threads.

我使用 Visual Studio 2012 ,我有 Nvidia Geforce 610M 。这是我的程序

I'm using Visual Studio 2012 and I have Nvidia Geforce 610M. This is my program

#include<stdio.h>
#include<iostream>
#include<conio.h>
#include <random>
#include <stdint.h>
#include <driver_types.h >


__global__ void odd(int *arr,int n){
    int i=threadIdx.x;
    int temp;
    if(i%2==1&&i<n-1){
        if(arr[i]>arr[i+1])
        {
            temp=arr[i];
            arr[i]=arr[i+1];
            arr[i+1]=temp;
        }
    }
}

__global__ void even(int *arr,int n){
    int i=threadIdx.x;
    int temp;
    if(i%2==0&&i<n-1){
        if(arr[i]>arr[i+1])
        {
            temp=arr[i];
            arr[i]=arr[i+1];
            arr[i+1]=temp;
        }
    }
}

int main(){
    int SIZE,k,*A,p,j;
    int *d_A;
    float time;

    printf("Enter the size of the array\n");
    scanf("%d",&SIZE);
    A=(int *)malloc(SIZE*sizeof(int));
    cudaMalloc(&d_A,SIZE*sizeof(int));
    for(k=0;k<SIZE;k++)
    A[k]=rand()%1000;


    cudaMemcpy(d_A,A,SIZE*sizeof(int),cudaMemcpyHostToDevice);
    if(SIZE%2==0)
        p=SIZE/2;
    else
        p=SIZE/2+1;


    for(j=0;j<p;j++){
        even<<<3,SIZE>>>(d_A,SIZE);
        if(j!=p-1)
            odd<<<3,SIZE>>>(d_A,SIZE);
        if(j==p-1&&SIZE%2==0)
            odd<<<1,SIZE>>>(d_A,SIZE);
    }


    cudaMemcpy(A,d_A,SIZE*sizeof(int),cudaMemcpyDeviceToHost);
    for(k=0;k<SIZE;k++)
        printf("%d ",A[k]);


    free(A);
    cudaFree(d_A);

    getch();

} 


推荐答案

CUDA threadblocks是限制为1024个线程(或512个线程,对于cc 1.x gpus)。线程块的大小在内核启动中的第二内核配置参数中指示:

CUDA threadblocks are limited to 1024 threads (or 512 threads, for cc 1.x gpus). The size of the threadblock is indicated in the second kernel configuration parameter in the kernel launch:

    even<<<3,SIZE>>>(d_A,SIZE);
             ^^^^

因此,当您输入 SIZE 值大于1024,这个内核不会启动。

So when you enter a SIZE value greater than 1024, this kernel will not launch.

你没有得到这个的指示,因为你不是proper cuda错误检查这总是一个好主意,任何时候你有一个CUDA代码的麻烦。您也可以快速测试,使用 cuda-memcheck 运行您的代码以查找API错误。

You're getting no indication of this because you're not doing proper cuda error checking which is always a good idea any time you're having trouble with a CUDA code. You can also, as a quick test, run your code with cuda-memcheck to look for API errors.

这篇关于Cuda程序不工作超过1024个线程的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆