矩阵乘法代码的PyCUDA精度 [英] PyCUDA precision of matrix multiplication code

查看:337
本文介绍了矩阵乘法代码的PyCUDA精度的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我试图学习 CUDA 并使用 PyCUDA 来编写一个简单的矩阵乘法代码。两个4x4的随机生成矩阵,我得到了以下解决方案:

I am trying to learn CUDA and using PyCUDA to write a simple matrix multiplication code. For two 4x4 randomly generated matrices I get the following solution:

Cuda:
[[ -5170.86181641 -21146.49609375  20690.02929688 -35413.9296875 ]
 [-18998.5         -3199.53271484  13364.62890625   7141.36816406]
 [ 31197.43164062  21095.02734375   1750.64453125  11304.63574219]
 [  -896.64978027  18424.33007812 -17135.00390625   7418.28417969]]

Python:
[[ -5170.86035156 -21146.49609375  20690.02929688 -35413.9296875 ]
 [-18998.5         -3199.53271484  13364.62695312   7141.36816406]
 [ 31197.43164062  21095.02929688   1750.64404297  11304.63574219]
 [  -896.64941406  18424.33007812 -17135.00390625   7418.28417969]]

Cuda-Python:
[[-0.00146484  0.          0.          0.        ]
 [ 0.          0.          0.00195312  0.        ]
 [ 0.         -0.00195312  0.00048828  0.        ]
 [-0.00036621  0.          0.          0.        ]]

错误是 1e-3 ,并且随着我增加矩阵的大小而增加。我不知道它的一个bug或不。我的问题是这样的大错误是否是正常的事情与 int32 或我做错了什么?

The error is of the order of 1e-3 and increases as I increase the size of matrices. I am not sure whether its a bug or not. My question is whether such a "large" error is a normal thing with int32 or I am doing something wrong?

这是源代码:

matmul.py

matmul.py

import numpy as np
import time
# import pycuda stuff
import pycuda.driver as cuda
import pycuda.autoinit
from pycuda.compiler import SourceModule

BLOCK_SIZE = 16

n = 4
ni = np.int32(n)

# matrix A 
a = np.random.randn(n, n)*100
a = a.astype(np.float32)

# matrix B
b = np.random.randn(n, n)*100
b = b.astype(np.float32)

# matrix B
c = np.empty([n, n])
c = c.astype(np.float32)

# allocate memory on device
a_gpu = cuda.mem_alloc(a.nbytes)
b_gpu = cuda.mem_alloc(b.nbytes)
c_gpu = cuda.mem_alloc(c.nbytes)

# copy matrix to memory
cuda.memcpy_htod(a_gpu, a)
cuda.memcpy_htod(b_gpu, b)

# compile kernel
mod = SourceModule(open("kernels.cu", "r").read())

# get function
matmul = mod.get_function("matmul");


# set grid size
if n%BLOCK_SIZE != 0:
    grid=(n/BLOCK_SIZE+1,n/BLOCK_SIZE+1,1)
else:
    grid=(n/BLOCK_SIZE,n/BLOCK_SIZE,1)

# call gpu function
start = time.time()
matmul(ni, a_gpu, b_gpu, c_gpu, block=(BLOCK_SIZE,BLOCK_SIZE,1), grid=grid);
end = time.time()
print "Time: %.5f s"%(end-start)

# copy back the result
cuda.memcpy_dtoh(c, c_gpu)

print np.linalg.norm(c - np.dot(a,b))
print c
print np.dot(a,b)
print c - np.dot(a,b)

kernels.cu

kernels.cu

__global__ void matmul(int n, const float *A, const float *B, float *C){

  int tx = threadIdx.x;
  int ty = threadIdx.y;

  int bx = blockIdx.x;
  int by = blockIdx.y;

  int row = by*blockDim.y + ty;
  int col = bx*blockDim.x + tx;

  if(row < n && col < n){
    float val = 0.0;
    for(int i=0; i<n; ++i){
      val += A[row*n + i]*B[n*i + col];
    }
    C[row*n + col] = val;
  }
}


推荐答案

增加了沃伦说的话。我不认为这里有什么错。你比较两个不同的机器(CPU和GPU)产生的浮点结果。它们不能保证在您正在考虑的级别的操作上按位相同,部分是因为GPU上的操作顺序不一定与GPU上的操作顺序相同。随着你增加矩阵的大小,你增加了加在一起的值的数量,并且你的绝对误差增加,因为你添加了一堆小错误在一起。

Just adding to what Warren has said. I don't think there's anything wrong here. You're comparing the floating point results generated by two different machines (CPU and GPU). They are not guaranteed to be bitwise identical for the operations at the level you are thinking about, partly because the order of operations on the GPU is not necessarily the same as the order of operations on the GPU. As you increase the size of the matrices, you are increasing the number of values summed together, and your absolute error increases, since you are adding a bunch of small bit errors together.

通常,在比较浮点结果时,这些类型的注意事项应始终起作用。很少从两个不同的计算中预期相同的结果。即使像改变操作的顺序一样简单,也可以对浮点运算做出不同的计算。您可能需要阅读本文,特别是第2.2节。

In general, these types of considerations should always come into play when comparing floating point results. Bitwise identical results are rarely to be expected from two different computations. And even something as simple as changing the order of operations can make it a different calculation, for floating point operations. You may want to read this paper especially section 2.2.

这篇关于矩阵乘法代码的PyCUDA精度的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆