在GPU中为许多高斯-勒加德正交积分分享根源和权重 [英] Sharing roots and weights for many Gauss-Legendre Quadrature in GPUs

查看:62
本文介绍了在GPU中为许多高斯-勒加德正交积分分享根源和权重的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我打算以并行方式计算许多数字正交,最终在所有计算中使用一组通用数据(相当大的根和权重数组占用大约25 Kb的内存).高斯-勒根德勒(Gauss-Legendre)正交方法非常简单,一开始就可以.我想通过声明 device double * d_droot,* d_dweight使设备中的所有线程,根和权重可用.但是我缺少了一些东西,因为我必须明确地将指针传递给数组,以使内核正常工作.我该怎么做呢?甚至,为了在设备上拥有更多的可用内存,是否有可能将根和重物消耗到设备内存的某个恒定部分?

I am intending to compute in parallel fashion a lot of numerical quadratures that at the end of the day use a common set of data for all the computations ( a quite big arrays of roots and weights ocupying about 25 Kb of memory). The Gauss-Legendre quadrature method is simple enought to start with. I want to make available to all the threads in the device, the roots and weights, through the declaration device double *d_droot, *d_dweight. But I am missing something because I have to pass explictly the pointers to the arrays to make my kernel to work well. How can I do it properly? Even more, aiming to have available more free memory on the device, is it possible to burn the roots and weights to some constant portion of the memory of the device?

代码已附上

#include <math.h>
#include <stdlib.h>
#include <stdio.h>


__device__  double *d_droot, *d_dweight;


__device__ __host__
double f(double alpha,double x)
{
  /*function to be integrated via gauss-legendre quadrature. */
  return exp(alpha*x);
}

__global__
void lege_inte2(int n, double alpha, double a, double b, double *lroots, double *weight, double *result)
{
  /*
    Parameters:
    n: Total number of quadratures
    a: Upper integration limit
    b: Lower integration limit
    lroots[]: roots for the quadrature
    weight[]: weights for the quadrature
    result[]: allocate the results for N quadratures.
   */
  double c1 = (b - a) / 2, c2 = (b + a) / 2, sum = 0;
  int dummy;

  int i = blockIdx.x*blockDim.x + threadIdx.x;
  if (i < n)
    {
      result[i] = 0.0;
      for (dummy = 0; dummy < 5; dummy++)
    result[i] += weight[dummy] * f(alpha,c1 * lroots[dummy] + c2)*c1;
    }
}

__global__
void lege_inte2_shared(int n,double alpha, double a, double b,  double *result)
{
  extern __shared__ double *d_droot;
  extern __shared__ double *d_dweight;
  /*
    Parameters:
    n: Total number of quadratures
    a: Upper integration limit
    b: Lower integration limit
    d_root[]: roots for the quadrature
    d_weight[]: weights for the quadrature
    result[]: allocate the results for N quadratures.
   */
  double c1 = (b - a) / 2, c2 = (b + a) / 2, sum = 0;
  int dummy;

  int i = blockIdx.x*blockDim.x + threadIdx.x;
  if (i < n)
    {
      result[i] = 0.0;
      for (dummy = 0; dummy < 5; dummy++)
    {
      result[i] += d_dweight[dummy] * f(alpha,c1 * d_droot[dummy] + c2)*c1;
      printf(" Vale: %f \n", d_dweight[dummy]);
    }
    }
}


int main(void)
{
  int N = 1<<23;
  int N_nodes = 5;


  double *droot, *dweight, *dresult, *d_dresult;


  /*double version in host*/
  droot =(double*)malloc(N_nodes*sizeof(double));
  dweight =(double*)malloc(N_nodes*sizeof(double));
  dresult =(double*)malloc(N*sizeof(double)); /*will recibe the results of N quadratures!*/


  /*double version in device*/
  cudaMalloc(&d_droot, N_nodes*sizeof(double));
  cudaMalloc(&d_dweight, N_nodes*sizeof(double));
  cudaMalloc(&d_dresult, N*sizeof(double)); /*results for N quadratures will be contained here*/


  /*double version of the roots and weights*/
  droot[0] = 0.90618;
  droot[1] = 0.538469;
  droot[2] = 0.0;
  droot[3] = -0.538469;
  droot[4] = -0.90618;


  dweight[0] = 0.236927;
  dweight[1] = 0.478629;
  dweight[2] = 0.568889;
  dweight[3] = 0.478629;
  dweight[4] = 0.236927;



  /*double copy host-> device*/
  cudaMemcpy(d_droot, droot, N_nodes*sizeof(double), cudaMemcpyHostToDevice);
  cudaMemcpy(d_dweight, dweight, N_nodes*sizeof(double), cudaMemcpyHostToDevice);


  // Perform SAXPY on 1M element

  lege_inte2<<<(N+255)/256, 256>>>(N,1.0,  -3.0, 3.0, d_droot, d_dweight, d_dresult); /*This kerlnel works OK*/
  //lege_inte2_shared<<<(N+255)/256, 256>>>(N,  -3.0, 3.0,  d_dresult); /*why this one does not work? */





  cudaMemcpy(dresult, d_dresult, N*sizeof(double), cudaMemcpyDeviceToHost); 

  double maxError = 0.0f;
  for (int i = 0; i < N; i++)
    maxError = max(maxError, abs(dresult[i]-20.03574985));
  printf("Max error: %f in %i quadratures \n", maxError, N);
  printf("integral: %f  \n" ,dresult[0]);



  cudaFree(dresult);
  cudaFree(d_droot);
  cudaFree(d_dweight);

}

和一个用于编译它的makefile:

and a makefile to compile it:

objects = main.o 

all: $(objects)
        nvcc   -Xcompiler -std=c99 -arch=sm_20 $(objects) -o gauss

%.o: %.cpp
        nvcc -x cu -arch=sm_20  -I. -dc $< -o $@

clean:
        rm -f *.o gauss

预先感谢您的任何建议

推荐答案

您对 d_droot d_dweight 的处理存在多种错误.当我编译您的代码时,会收到如下各种警告:

Your handling of d_droot and d_dweight has a variety of errors. When I compile your code, I get various warnings like this:

t640.cu(86): warning: address of a __shared__ variable "d_droot" cannot be directly taken in a host function

t640.cu(87): warning: address of a __shared__ variable "d_dweight" cannot be directly taken in a host function

t640.cu(108): warning: a __shared__ variable "d_droot" cannot be directly read in a host function

t640.cu(109): warning: a __shared__ variable "d_dweight" cannot be directly read in a host function

不应忽略.

  1. 这些声明:

  1. These declarations:

__device__  double *d_droot, *d_dweight;

不定义 __ shared __ 变量,因此这些行:

do not not define __shared__ variables, so these lines:

extern __shared__ double *d_droot;
extern __shared__ double *d_dweight;

毫无道理.此外,如果您确实希望这些变量是动态分配的共享变量(用于 extern __shared __ 的用途),您需要将分配大小作为第三个内核启动参数传递,而您并没有这样做.

make no sense. Furthermore, if you did want these to be dynamically allocated shared variables (what extern __shared__ is used for), you would need to pass the allocation size as the 3rd kernel launch parameter, which you are not doing.

这些陈述是不正确的:

cudaMalloc(&d_droot, N_nodes*sizeof(double));
cudaMalloc(&d_dweight, N_nodes*sizeof(double));

您不能在主机代码中使用 __ device __ 变量的地址,而且我们也不会使用 cudaMalloc 分配 __ device __ 变量;根据定义,它是静态分配.

You cannot take the address of a __device__ variable in host code, and we don't use cudaMalloc to allocate a __device__ variable anyway; it is a static allocation by definition.

我建议进行正确的cuda错误检查.作为快速测试,您还可以使用 cuda-memcheck 运行代码.两种方法都将指示代码中存在运行时错误(尽管不是任何问题的症结所在).

I recommend doing proper cuda error checking. As a quick test, you can also run your code with cuda-memcheck. Either method would indicate the presence of a runtime error in your code (albeit not the crux of any issue).

这些陈述也不正确:

cudaMemcpy(d_droot, droot, N_nodes*sizeof(double), cudaMemcpyHostToDevice);
cudaMemcpy(d_dweight, dweight, N_nodes*sizeof(double), cudaMemcpyHostToDevice);

cudaMemcpy

cudaMemcpy is not the correct API to use with a __device__ variable. Use cudaMemcpyToSymbol instead.

下面的代码已修复了这些各种用法错误,可以干净地编译,并且似乎可以正常运行.它表明没有必要将 __ device __ 变量作为内核参数传递:

The following code has these various usage errors fixed, will compile cleanly, and seems to run correctly. It demonstrates that it is not necessary to pass a __device__ variable as a kernel parameter:

#include <math.h>
#include <stdlib.h>
#include <stdio.h>


__device__  double *d_droot, *d_dweight;


__device__ __host__
double f(double alpha,double x)
{
  /*function to be integrated via gauss-legendre quadrature. */
  return exp(alpha*x);
}

__global__
void lege_inte2(int n, double alpha, double a, double b, double *result)
{
  /*
    Parameters:
    n: Total number of quadratures
    a: Upper integration limit
    b: Lower integration limit
    lroots[]: roots for the quadrature
    weight[]: weights for the quadrature
    result[]: allocate the results for N quadratures.
   */
  double c1 = (b - a) / 2, c2 = (b + a) / 2, sum = 0;
  int dummy;

  int i = blockIdx.x*blockDim.x + threadIdx.x;
  if (i < n)
    {
      result[i] = 0.0;
      for (dummy = 0; dummy < 5; dummy++)
    result[i] += d_dweight[dummy] * f(alpha,c1 * d_droot[dummy] + c2)*c1;
    }
}

__global__
void lege_inte2_shared(int n,double alpha, double a, double b,  double *result)
{
  /*
    Parameters:
    n: Total number of quadratures
    a: Upper integration limit
    b: Lower integration limit
    d_root[]: roots for the quadrature
    d_weight[]: weights for the quadrature
    result[]: allocate the results for N quadratures.
   */
  double c1 = (b - a) / 2, c2 = (b + a) / 2, sum = 0;
  int dummy;

  int i = blockIdx.x*blockDim.x + threadIdx.x;
  if (i < n)
    {
      result[i] = 0.0;
      for (dummy = 0; dummy < 5; dummy++)
    {
      result[i] += d_dweight[dummy] * f(alpha,c1 * d_droot[dummy] + c2)*c1;
      printf(" Vale: %f \n", d_dweight[dummy]);
    }
    }
}


int main(void)
{
  int N = 1<<23;
  int N_nodes = 5;


  double *droot, *dweight, *dresult, *d_dresult, *d_droot_temp, *d_dweight_temp;


  /*double version in host*/
  droot =(double*)malloc(N_nodes*sizeof(double));
  dweight =(double*)malloc(N_nodes*sizeof(double));
  dresult =(double*)malloc(N*sizeof(double)); /*will recibe the results of N quadratures!*/


  /*double version in device*/
  cudaMalloc(&d_droot_temp, N_nodes*sizeof(double));
  cudaMalloc(&d_dweight_temp, N_nodes*sizeof(double));
  cudaMalloc(&d_dresult, N*sizeof(double)); /*results for N quadratures will be contained here*/


  /*double version of the roots and weights*/
  droot[0] = 0.90618;
  droot[1] = 0.538469;
  droot[2] = 0.0;
  droot[3] = -0.538469;
  droot[4] = -0.90618;


  dweight[0] = 0.236927;
  dweight[1] = 0.478629;
  dweight[2] = 0.568889;
  dweight[3] = 0.478629;
  dweight[4] = 0.236927;



  /*double copy host-> device*/
  cudaMemcpy(d_droot_temp, droot, N_nodes*sizeof(double), cudaMemcpyHostToDevice);
  cudaMemcpy(d_dweight_temp, dweight, N_nodes*sizeof(double), cudaMemcpyHostToDevice);
  cudaMemcpyToSymbol(d_droot, &d_droot_temp, sizeof(double *));
  cudaMemcpyToSymbol(d_dweight, &d_dweight_temp, sizeof(double *));

  // Perform SAXPY on 1M element

  lege_inte2<<<(N+255)/256, 256>>>(N,1.0,  -3.0, 3.0, d_dresult); /*This kerlnel works OK*/
  //lege_inte2_shared<<<(N+255)/256, 256>>>(N,  -3.0, 3.0,  d_dresult); /*why this one does not work? */





  cudaMemcpy(dresult, d_dresult, N*sizeof(double), cudaMemcpyDeviceToHost);

  double maxError = 0.0f;
  for (int i = 0; i < N; i++)
    maxError = max(maxError, abs(dresult[i]-20.03574985));
  printf("Max error: %f in %i quadratures \n", maxError, N);
  printf("integral: %f  \n" ,dresult[0]);



  cudaFree(d_dresult);
  cudaFree(d_droot_temp);
  cudaFree(d_dweight_temp);

}

(我无法保证结果.)

现在,关于这个问题:

甚至更多,目的是在设备上拥有更多的可用内存,是否有可能将根和重物消耗到设备内存的某个恒定部分?

Even more, aiming to have available more free memory on the device, is it possible to burn the roots and weights to some constant portion of the memory of the device?

由于您对 d_dweight d_droot 的访问似乎是统一的:

Since your accesses of d_dweight and d_droot appear to be uniform:

result[i] += d_dweight[dummy] * f(alpha,c1 * d_droot[dummy] + c2)*c1;

然后定义它们可能会有用作为 __ constant __ 内存空间变量.当warp中的每个线程都在常量内存中请求相同的值(相同的位置)时,常量内存访问是最佳的.但是,不能动态分配 __ constant __ 内存,并且将指针(仅)存储在常量内存中是没有意义的.这并没有提供恒定缓存机制的任何好处.

Then it may be useful to define these as __constant__ memory space variables. Constant memory access is optimum when every thread in a warp is requesting the same value (same location) in constant memory. However, __constant__ memory cannot be allocated dynamically, and it makes no sense to store a pointer (only) in constant memory; this doesn't provide any of the benefits of the constant cache mechanism.

因此,对代码的以下进一步修改说明了如何将这些值存储在 __ constant __ 内存中,但是需要静态分配.此外,这实际上并没有节省"任何设备内存.无论您是使用 cudaMalloc 动态分配,使用 __ device __ 变量静态分配,还是通过 __ constant __ 变量定义(也是静态分配)所有这些方法都需要将全局内存备份存储在设备内存(板载DRAM)中.

Therefore, the following further modification to your code demonstrates how to store these values in __constant__ memory, but it requires a static allocation. Furthermore, this doesn't really "save" any device memory. Whether you allocate dynamically using cudaMalloc, statically with a __device__ variable, or via a __constant__ variable definition (also a static allocation), all of these methods require global memory backing store in device memory (on-board DRAM).

表明可能的恒定内存使用量的代码:

Code demonstrating possible constant memory usage:

#include <math.h>
#include <stdlib.h>
#include <stdio.h>

#define N_nodes 5

__constant__   double d_droot[N_nodes], d_dweight[N_nodes];


__device__ __host__
double f(double alpha,double x)
{
  /*function to be integrated via gauss-legendre quadrature. */
  return exp(alpha*x);
}

__global__
void lege_inte2(int n, double alpha, double a, double b, double *result)
{
  /*
    Parameters:
    n: Total number of quadratures
    a: Upper integration limit
    b: Lower integration limit
    lroots[]: roots for the quadrature
    weight[]: weights for the quadrature
    result[]: allocate the results for N quadratures.
   */
  double c1 = (b - a) / 2, c2 = (b + a) / 2, sum = 0;
  int dummy;

  int i = blockIdx.x*blockDim.x + threadIdx.x;
  if (i < n)
    {
      result[i] = 0.0;
      for (dummy = 0; dummy < 5; dummy++)
    result[i] += d_dweight[dummy] * f(alpha,c1 * d_droot[dummy] + c2)*c1;
    }
}

__global__
void lege_inte2_shared(int n,double alpha, double a, double b,  double *result)
{
  /*
    Parameters:
    n: Total number of quadratures
    a: Upper integration limit
    b: Lower integration limit
    d_root[]: roots for the quadrature
    d_weight[]: weights for the quadrature
    result[]: allocate the results for N quadratures.
   */
  double c1 = (b - a) / 2, c2 = (b + a) / 2, sum = 0;
  int dummy;

  int i = blockIdx.x*blockDim.x + threadIdx.x;
  if (i < n)
    {
      result[i] = 0.0;
      for (dummy = 0; dummy < 5; dummy++)
    {
      result[i] += d_dweight[dummy] * f(alpha,c1 * d_droot[dummy] + c2)*c1;
      printf(" Vale: %f \n", d_dweight[dummy]);
    }
    }
}


int main(void)
{
  int N = 1<<23;
 // int N_nodes = 5;


  double *droot, *dweight, *dresult, *d_dresult;


  /*double version in host*/
  droot =(double*)malloc(N_nodes*sizeof(double));
  dweight =(double*)malloc(N_nodes*sizeof(double));
  dresult =(double*)malloc(N*sizeof(double)); /*will recibe the results of N quadratures!*/


  /*double version in device*/
  cudaMalloc(&d_dresult, N*sizeof(double)); /*results for N quadratures will be contained here*/


  /*double version of the roots and weights*/
  droot[0] = 0.90618;
  droot[1] = 0.538469;
  droot[2] = 0.0;
  droot[3] = -0.538469;
  droot[4] = -0.90618;


  dweight[0] = 0.236927;
  dweight[1] = 0.478629;
  dweight[2] = 0.568889;
  dweight[3] = 0.478629;
  dweight[4] = 0.236927;



  /*double copy host-> device*/
  cudaMemcpyToSymbol(d_droot, droot, N_nodes*sizeof(double));
  cudaMemcpyToSymbol(d_dweight, dweight, N_nodes*sizeof(double));

  // Perform SAXPY on 1M element

  lege_inte2<<<(N+255)/256, 256>>>(N,1.0,  -3.0, 3.0, d_dresult); /*This kerlnel works OK*/
  //lege_inte2_shared<<<(N+255)/256, 256>>>(N,  -3.0, 3.0,  d_dresult); /*why this one does not work? */





  cudaMemcpy(dresult, d_dresult, N*sizeof(double), cudaMemcpyDeviceToHost);

  double maxError = 0.0f;
  for (int i = 0; i < N; i++)
    maxError = max(maxError, abs(dresult[i]-20.03574985));
  printf("Max error: %f in %i quadratures \n", maxError, N);
  printf("integral: %f  \n" ,dresult[0]);



  cudaFree(d_dresult);

}

这篇关于在GPU中为许多高斯-勒加德正交积分分享根源和权重的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆