使用 cudaMemcpy2D 复制一维跨步数据的方法 [英] Recipe to copy 1D strided data with cudaMemcpy2D
问题描述
如果一个设备内存有两个连续范围,则可以使用 cudaMemcpy
将内存从一个复制到另一个.
If one has two continuous ranges of device memory it is possible to copy memory from from one to the other using cudaMemcpy
.
double* source = ...
double* dest = ...
cudaMemcpy(dest, source, N, cudaMemcpyDeviceToDevice);
现在假设我想将 source 复制到 dest,但分别是每 2 或 3 个元素.即 dest[0] = source[0], dest[3] = source[2], dest[6] = source[4], ...
.当然,一个普通的 cudaMemcpy
无法做到这一点.
Now suppose that I want to copy source into dest, but every 2 or 3 elements respectively.
That is dest[0] = source[0], dest[3] = source[2], dest[6] = source[4], ...
.
Of course a single plain cudaMemcpy
cannot do this.
直观地,cudaMemcpy2D
应该能够完成这项工作,因为跨步元素可以被视为更大数组中的一列".但是 cudaMemcpy2D
它有许多输入参数在这种情况下难以解释,例如 pitch
.
Intuitively, cudaMemcpy2D
should be able to do the job, because "strided elements can be see as a column in a larger array".
But cudaMemcpy2D
it has many input parameters that are obscure to interpret in this context, such as pitch
.
例如,我经理使用 cudaMemcpy2D
来重现两个步幅均为 1 的情况.
For example, I manager to use cudaMemcpy2D
to reproduce the case where both strides are 1.
cudaMemcpy2D(dest, 1, source, 1, 1, n*sizeof(T), cudaMemcpyDeviceToHost);
但我无法弄清楚一般情况,dest_stride
和 source_stride
与 1 的差异.
But I cannot figure out the general case, with dest_stride
and source_stride
difference from 1.
有没有办法用 cudaMemcpy2D
将跨步数据复制到跨步数据?我必须以什么顺序放置关于布局的已知信息?,即两个步幅和sizeof(T)
.
Is there a way to copy strided data to stride data with cudaMemcpy2D
?
In which order do I have to put the known information about the layout?, namely, in terms of the two strides and sizeof(T)
.
cudaMemcpy2D(dest, ??, source, ???, ????, ????, cudaMemcpyDeviceToHost);
推荐答案
用于这种跨步复制的通用函数大致如下所示:
A generic function for such a strided copy could look roughly like this:
void cudaMemcpyStrided(
void *dst, int dstStride,
void *src, int srcStride,
int numElements, int elementSize, int kind) {
int srcPitchInBytes = srcStride * elementSize;
int dstPitchInBytes = dstStride * elementSize;
int width = 1 * elementSize;
int height = numElements;
cudaMemcpy2D(
dst, dstPitchInBytes,
src, srcPitchInBytes,
width, height,
kind);
}
对于你的例子,它可以被称为
And for your example, it could be called as
cudaMemcpyStrided(dest, 3, source, 2, 3, sizeof(double), cudaMemcpyDeviceToDevice);
<小时>
大致",因为我只是从我测试过的(基于 Java/JCuda 的)代码中即时翻译了它:
"Roughly", because I just translated it on the fly from the (Java/JCuda based) code that I tested it with:
import static jcuda.runtime.JCuda.cudaMemcpy2D;
import java.util.Arrays;
import java.util.Locale;
import jcuda.Pointer;
import jcuda.Sizeof;
import jcuda.runtime.cudaMemcpyKind;
public class JCudaStridedMemcopy {
public static void main(String[] args) {
int dstLength = 9;
int srcLength = 6;
int dstStride = 3;
int srcStride = 2;
int numElements = 3;
runExample(dstLength, dstStride, srcLength, srcStride, numElements);
dstLength = 9;
srcLength = 12;
dstStride = 3;
srcStride = 4;
numElements = 3;
runExample(dstLength, dstStride, srcLength, srcStride, numElements);
dstLength = 18;
srcLength = 12;
dstStride = 3;
srcStride = 2;
numElements = 6;
runExample(dstLength, dstStride, srcLength, srcStride, numElements);
}
private static void runExample(int dstLength, int dstStride, int srcLength, int srcStride, int numElements) {
double dst[] = new double[dstLength];
double src[] = new double[srcLength];
for (int i = 0; i < src.length; i++) {
src[i] = i;
}
cudaMemcpyStrided(dst, dstStride, src, srcStride, numElements);
System.out.println("Copy " + numElements + " elements");
System.out.println(" to array with length " + dstLength + ", with a stride of " + dstStride);
System.out.println(" from array with length " + srcLength + ", with a stride of " + srcStride);
System.out.println("");
System.out.println("Destination:");
System.out.println(toString2D(dst, dstStride));
System.out.println("Flat: " + Arrays.toString(dst));
System.out.println("");
System.out.println("Source:");
System.out.println(toString2D(src, srcStride));
System.out.println("Flat: " + Arrays.toString(src));
System.out.println("");
System.out.println("Done");
System.out.println("");
}
private static void cudaMemcpyStrided(double dst[], int dstStride, double src[], int srcStride, int numElements) {
long srcPitchInBytes = srcStride * Sizeof.DOUBLE;
long dstPitchInBytes = dstStride * Sizeof.DOUBLE;
long width = 1 * Sizeof.DOUBLE;
long height = numElements;
cudaMemcpy2D(
Pointer.to(dst), dstPitchInBytes,
Pointer.to(src), srcPitchInBytes,
width, height,
cudaMemcpyKind.cudaMemcpyHostToHost);
}
public static String toString2D(double[] a, long columns) {
String format = "%4.1f ";
;
StringBuilder sb = new StringBuilder();
for (int i = 0; i < a.length; i++) {
if (i > 0 && i % columns == 0) {
sb.append("\n");
}
sb.append(String.format(Locale.ENGLISH, format, a[i]));
}
return sb.toString();
}
}
为了了解函数的作用,基于示例/测试用例,这里是输出:
To give an idea of what the function does, based on the examples/test cases, here is the output:
Copy 3 elements
to array with length 9, with a stride of 3
from array with length 6, with a stride of 2
Destination:
0.0 0.0 0.0
2.0 0.0 0.0
4.0 0.0 0.0
Flat: [0.0, 0.0, 0.0, 2.0, 0.0, 0.0, 4.0, 0.0, 0.0]
Source:
0.0 1.0
2.0 3.0
4.0 5.0
Flat: [0.0, 1.0, 2.0, 3.0, 4.0, 5.0]
Done
Copy 3 elements
to array with length 9, with a stride of 3
from array with length 12, with a stride of 4
Destination:
0.0 0.0 0.0
4.0 0.0 0.0
8.0 0.0 0.0
Flat: [0.0, 0.0, 0.0, 4.0, 0.0, 0.0, 8.0, 0.0, 0.0]
Source:
0.0 1.0 2.0 3.0
4.0 5.0 6.0 7.0
8.0 9.0 10.0 11.0
Flat: [0.0, 1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0, 10.0, 11.0]
Done
Copy 6 elements
to array with length 18, with a stride of 3
from array with length 12, with a stride of 2
Destination:
0.0 0.0 0.0
2.0 0.0 0.0
4.0 0.0 0.0
6.0 0.0 0.0
8.0 0.0 0.0
10.0 0.0 0.0
Flat: [0.0, 0.0, 0.0, 2.0, 0.0, 0.0, 4.0, 0.0, 0.0, 6.0, 0.0, 0.0, 8.0, 0.0, 0.0, 10.0, 0.0, 0.0]
Source:
0.0 1.0
2.0 3.0
4.0 5.0
6.0 7.0
8.0 9.0
10.0 11.0
Flat: [0.0, 1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0, 10.0, 11.0]
Done
这篇关于使用 cudaMemcpy2D 复制一维跨步数据的方法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!