复制跨入数据(并从CUDA设备)有效的方式? [英] Efficient way to copy strided data (to and from a CUDA Device)?
问题描述
是否有复制通过一个恒定(或者甚至非恒定)值跨入和从CUDA设备有效的数据的可能性?
Is there a possibility to copy data strided by a constant (or even non-constant) value to and from the CUDA device efficiently?
我想对角化的大型对称矩阵。
I want to diagonalize a large symmetric matrix.
使用雅可比算法有一堆使用两行和每次迭代内的两列操作。
Using the jacobi algorithm there is a bunch of operations using two rows and two columns within each iteration.
由于矩阵本身过大,复制到设备完全我寻找一种方法来在两个行和列复制到设备
Since the Matrix itself is too big to be copied to the device entirely i am looking for a way to copy the two rows and columns to the device.
这将是很好用的三角矩阵的形式来存储数据,但像
It would be nice to use the triangular matrix form to store the data but additional downsides like
- 非恒行长度[不是问题的那种]
- 列值的非固定的步[1每行的步幅增加。]
出现。
我看了一些计时,并认识到,复制跨入值逐一速度很慢(同步和异步)。
I looked at some timings and recognized that copying strided values one by one is very slow (synchronous as well as async.).
//编辑:删除解决方案 - 增加了一个答案
// edit: removed solution - added an answer
推荐答案
感谢罗伯特Crovella给予正确的提示使用cudamemcpy2d。
我会追加我的测试code给大家的可能性COM prehend ...
Thanks to Robert Crovella for giving the right hint to use cudamemcpy2d. I'll append my test code to give everyone the possibility to comprehend...
如果任何人的建议上来解决使用行主要下令三角矩阵拷贝的问题 - 随时写信另一个答案,请
If anyone Comes up with suggestions for solving the copy problem using row-major-ordered triangular matrices - feel free to write another answer please.
__global__ void setValues (double *arr, double value)
{
arr[blockIdx.x] = value;
}
int main( void )
{
// define consts
static size_t const R = 10, C = 10, RC = R*C;
// create matrices and initialize
double * matrix = (double*) malloc(RC*sizeof(double)),
*final_matrix = (double*) malloc(RC*sizeof(double));
for (size_t i=0; i<RC; ++i) matrix[i] = rand()%R+10;
memcpy(final_matrix, matrix, RC*sizeof(double));
// create vectors on the device
double *dev_col, *dev_row,
*h_row = (double*) malloc(C*sizeof(double)),
*h_col = (double*) malloc(R*sizeof(double));
cudaMalloc((void**)&dev_row, C * sizeof(double));
cudaMalloc((void**)&dev_col, R * sizeof(double));
// choose row / col to copy
size_t selected_row = 7, selected_col = 3;
// since we are in row-major order we can copy the row at once
cudaMemcpy(dev_row, &matrix[selected_row*C],
C * sizeof(double), cudaMemcpyHostToDevice);
// the colum needs to be copied using cudaMemcpy2D
// with Columnsize*sizeof(type) as source pitch
cudaMemcpy2D(dev_col, sizeof(double), &matrix[selected_col],
C*sizeof(double), sizeof(double), R, cudaMemcpyHostToDevice);
// copy back to host to check whether we got the right column and row
cudaMemcpy(h_row, dev_row, C * sizeof(double), cudaMemcpyDeviceToHost);
cudaMemcpy(h_col, dev_col, R * sizeof(double), cudaMemcpyDeviceToHost);
// change values to evaluate backcopy
setValues<<<R, 1>>>(dev_col, 88.0); // column should be 88
setValues<<<C, 1>>>(dev_row, 99.0); // row should be 99
// backcopy
cudaMemcpy(&final_matrix[selected_row*C], dev_row,
C * sizeof(double), cudaMemcpyDeviceToHost);
cudaMemcpy2D(&final_matrix[selected_col], C*sizeof(double), dev_col,
sizeof(double), sizeof(double), R, cudaMemcpyDeviceToHost);
cudaDeviceSynchronize();
// output for checking functionality
printf("Initial Matrix:\n");
for (size_t i=0; i<R; ++i)
{
for (size_t j=0; j<C; ++j) printf(" %lf", matrix[i*C+j]);
printf("\n");
}
printf("\nRow %u values: ", selected_row);
for (size_t i=0; i<C; ++i) printf(" %lf", h_row[i]);
printf("\nCol %u values: ", selected_col);
for (size_t i=0; i<R; ++i) printf(" %lf", h_col[i]);
printf("\n\n");
printf("Final Matrix:\n");
for (size_t i=0; i<R; ++i)
{
for (size_t j=0; j<C; ++j) printf(" %lf", final_matrix[i*C+j]);
printf("\n");
}
cudaFree(dev_col);
cudaFree(dev_row);
free(matrix);
free(final_matrix);
free(h_row);
free(h_col);
cudaDeviceReset();
return 0;
}
这篇关于复制跨入数据(并从CUDA设备)有效的方式?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!