复制跨入数据(并从CUDA设备)有效的方式? [英] Efficient way to copy strided data (to and from a CUDA Device)?

查看:165
本文介绍了复制跨入数据(并从CUDA设备)有效的方式?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

是否有复制通过一个恒定(或者甚至非恒定)值跨入和从CUDA设备有效的数据的可能性?

Is there a possibility to copy data strided by a constant (or even non-constant) value to and from the CUDA device efficiently?

我想对角化的大型对称矩阵。

I want to diagonalize a large symmetric matrix.

使用雅可比算法有一堆使用两行和每次迭代内的两列操作。

Using the jacobi algorithm there is a bunch of operations using two rows and two columns within each iteration.

由于矩阵本身过大,复制到设备完全我寻找一种方法来在两个行和列复制到设备

Since the Matrix itself is too big to be copied to the device entirely i am looking for a way to copy the two rows and columns to the device.

这将是很好用的三角矩阵的形式来存储数据,但像

It would be nice to use the triangular matrix form to store the data but additional downsides like


  • 非恒行长度[不是问题的那种]

  • 列值的非固定的步[1每行的步幅增加。]

出现。

我看了一些计时,并认识到,复制跨入值逐一速度很慢(同步和异步)。

I looked at some timings and recognized that copying strided values one by one is very slow (synchronous as well as async.).

//编辑:删除解决方案 - 增加了一个答案

// edit: removed solution - added an answer

推荐答案

感谢罗伯特Crovella给予正确的提示使用cud​​amemcpy2d。
我会追加我的测试code给大家的可能性COM prehend ...

Thanks to Robert Crovella for giving the right hint to use cudamemcpy2d. I'll append my test code to give everyone the possibility to comprehend...

如果任何人的建议上来解决使用行主要下令三角矩阵拷贝的问题 - 随时写信另一个答案,请

If anyone Comes up with suggestions for solving the copy problem using row-major-ordered triangular matrices - feel free to write another answer please.

__global__ void setValues (double *arr, double value)
{
  arr[blockIdx.x] = value;
}

int main( void ) 
{
  // define consts
  static size_t const R = 10, C = 10, RC = R*C;

  // create matrices and initialize
  double * matrix = (double*) malloc(RC*sizeof(double)), 
    *final_matrix = (double*) malloc(RC*sizeof(double));
  for (size_t i=0; i<RC; ++i) matrix[i] = rand()%R+10;
  memcpy(final_matrix, matrix, RC*sizeof(double));

  // create vectors on the device
  double *dev_col, *dev_row, 
    *h_row = (double*) malloc(C*sizeof(double)), 
    *h_col = (double*) malloc(R*sizeof(double));
  cudaMalloc((void**)&dev_row, C * sizeof(double));
  cudaMalloc((void**)&dev_col, R * sizeof(double));

  // choose row / col to copy
  size_t selected_row = 7, selected_col = 3;

  // since we are in row-major order we can copy the row at once 
  cudaMemcpy(dev_row, &matrix[selected_row*C], 
    C * sizeof(double), cudaMemcpyHostToDevice);
  // the colum needs to be copied using cudaMemcpy2D 
  // with Columnsize*sizeof(type) as source pitch
  cudaMemcpy2D(dev_col, sizeof(double), &matrix[selected_col], 
    C*sizeof(double), sizeof(double), R, cudaMemcpyHostToDevice);

  // copy back to host to check whether we got the right column and row
  cudaMemcpy(h_row, dev_row, C * sizeof(double), cudaMemcpyDeviceToHost);
  cudaMemcpy(h_col, dev_col, R * sizeof(double), cudaMemcpyDeviceToHost);
  // change values to evaluate backcopy
  setValues<<<R, 1>>>(dev_col, 88.0); // column should be 88
  setValues<<<C, 1>>>(dev_row, 99.0); // row should be 99
  // backcopy
  cudaMemcpy(&final_matrix[selected_row*C], dev_row, 
    C * sizeof(double), cudaMemcpyDeviceToHost);
  cudaMemcpy2D(&final_matrix[selected_col], C*sizeof(double), dev_col, 
    sizeof(double), sizeof(double), R, cudaMemcpyDeviceToHost);

  cudaDeviceSynchronize();
  // output for checking functionality

  printf("Initial Matrix:\n");
  for (size_t i=0; i<R; ++i)
  {
    for (size_t j=0; j<C; ++j) printf(" %lf", matrix[i*C+j]);
    printf("\n");
  }
  printf("\nRow %u values: ", selected_row);
  for (size_t i=0; i<C; ++i) printf(" %lf", h_row[i]);
  printf("\nCol %u values: ", selected_col);
  for (size_t i=0; i<R; ++i) printf(" %lf", h_col[i]);
  printf("\n\n");

  printf("Final Matrix:\n");
  for (size_t i=0; i<R; ++i)
  {
    for (size_t j=0; j<C; ++j) printf(" %lf", final_matrix[i*C+j]);
    printf("\n");
  }

  cudaFree(dev_col);
  cudaFree(dev_row);
  free(matrix);
  free(final_matrix);
  free(h_row);
  free(h_col);
  cudaDeviceReset();
  return 0;

}

这篇关于复制跨入数据(并从CUDA设备)有效的方式?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆