复制跨入数据（并从CUDA设备）有效的方式？ [英] Efficient way to copy strided data (to and from a CUDA Device)?

查看：165 发布时间：2016/8/21 21:53:47 c++ c matrix cuda memcpy

本文介绍了复制跨入数据（并从CUDA设备）有效的方式？的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

是否有复制通过一个恒定（或者甚至非恒定）值跨入和从CUDA设备有效的数据的可能性？

Is there a possibility to copy data strided by a constant (or even non-constant) value to and from the CUDA device efficiently?

我想对角化的大型对称矩阵。

I want to diagonalize a large symmetric matrix.

使用雅可比算法有一堆使用两行和每次迭代内的两列操作。

Using the jacobi algorithm there is a bunch of operations using two rows and two columns within each iteration.

由于矩阵本身过大，复制到设备完全我寻找一种方法来在两个行和列复制到设备

Since the Matrix itself is too big to be copied to the device entirely i am looking for a way to copy the two rows and columns to the device.

这将是很好用的三角矩阵的形式来存储数据，但像

It would be nice to use the triangular matrix form to store the data but additional downsides like

非恒行长度[不是问题的那种]

列值的非固定的步[1每行的步幅增加。]

出现。

我看了一些计时，并认识到，复制跨入值逐一速度很慢（同步和异步）。

I looked at some timings and recognized that copying strided values one by one is very slow (synchronous as well as async.).

//编辑：删除解决方案 - 增加了一个答案

// edit: removed solution - added an answer

推荐答案

感谢罗伯特Crovella给予正确的提示使用cudamemcpy2d。
我会追加我的测试code给大家的可能性COM prehend ...

Thanks to Robert Crovella for giving the right hint to use cudamemcpy2d. I'll append my test code to give everyone the possibility to comprehend...

如果任何人的建议上来解决使用行主要下令三角矩阵拷贝的问题 - 随时写信另一个答案，请

If anyone Comes up with suggestions for solving the copy problem using row-major-ordered triangular matrices - feel free to write another answer please.

__global__ void setValues (double *arr, double value)
{
  arr[blockIdx.x] = value;
}

int main( void ) 
{
  // define consts
  static size_t const R = 10, C = 10, RC = R*C;

  // create matrices and initialize
  double * matrix = (double*) malloc(RC*sizeof(double)), 
    *final_matrix = (double*) malloc(RC*sizeof(double));
  for (size_t i=0; i<RC; ++i) matrix[i] = rand()%R+10;
  memcpy(final_matrix, matrix, RC*sizeof(double));

  // create vectors on the device
  double *dev_col, *dev_row, 
    *h_row = (double*) malloc(C*sizeof(double)), 
    *h_col = (double*) malloc(R*sizeof(double));
  cudaMalloc((void**)&dev_row, C * sizeof(double));
  cudaMalloc((void**)&dev_col, R * sizeof(double));

  // choose row / col to copy
  size_t selected_row = 7, selected_col = 3;

  // since we are in row-major order we can copy the row at once 
  cudaMemcpy(dev_row, &matrix[selected_row*C], 
    C * sizeof(double), cudaMemcpyHostToDevice);
  // the colum needs to be copied using cudaMemcpy2D 
  // with Columnsize*sizeof(type) as source pitch
  cudaMemcpy2D(dev_col, sizeof(double), &matrix[selected_col], 
    C*sizeof(double), sizeof(double), R, cudaMemcpyHostToDevice);

  // copy back to host to check whether we got the right column and row
  cudaMemcpy(h_row, dev_row, C * sizeof(double), cudaMemcpyDeviceToHost);
  cudaMemcpy(h_col, dev_col, R * sizeof(double), cudaMemcpyDeviceToHost);
  // change values to evaluate backcopy
  setValues<<<R, 1>>>(dev_col, 88.0); // column should be 88
  setValues<<<C, 1>>>(dev_row, 99.0); // row should be 99
  // backcopy
  cudaMemcpy(&final_matrix[selected_row*C], dev_row, 
    C * sizeof(double), cudaMemcpyDeviceToHost);
  cudaMemcpy2D(&final_matrix[selected_col], C*sizeof(double), dev_col, 
    sizeof(double), sizeof(double), R, cudaMemcpyDeviceToHost);

  cudaDeviceSynchronize();
  // output for checking functionality

  printf("Initial Matrix:\n");
  for (size_t i=0; i<R; ++i)
  {
    for (size_t j=0; j<C; ++j) printf(" %lf", matrix[i*C+j]);
    printf("\n");
  }
  printf("\nRow %u values: ", selected_row);
  for (size_t i=0; i<C; ++i) printf(" %lf", h_row[i]);
  printf("\nCol %u values: ", selected_col);
  for (size_t i=0; i<R; ++i) printf(" %lf", h_col[i]);
  printf("\n\n");

  printf("Final Matrix:\n");
  for (size_t i=0; i<R; ++i)
  {
    for (size_t j=0; j<C; ++j) printf(" %lf", final_matrix[i*C+j]);
    printf("\n");
  }

  cudaFree(dev_col);
  cudaFree(dev_row);
  free(matrix);
  free(final_matrix);
  free(h_row);
  free(h_col);
  cudaDeviceReset();
  return 0;

}

这篇关于复制跨入数据（并从CUDA设备）有效的方式？的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

复制跨入数据（并从CUDA设备）有效的方式？ [英] Efficient way to copy strided data (to and from a CUDA Device)?

问题描述

推荐答案

相关文章

C/C++开发最新文章

热门教程

热门工具

登录关闭

复制跨入数据（并从CUDA设备）有效的方式？ [英] Efficient way to copy strided data (to and from a CUDA Device)?

问题描述

推荐答案

相关文章

C/C++开发最新文章

热门教程

热门工具

登录 关闭

登录关闭