最小矩阵尺寸可受益于GPU上的矩阵乘法 [英] Minimum matrix sizes to benefit from matrix multiplication on GPU

查看:106
本文介绍了最小矩阵尺寸可受益于GPU上的矩阵乘法的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我对使用Metal Performance Shaders进行矩阵乘法特别感兴趣,但是有关其他框架的答案也很好.

I am particularly interested in matrix multiplication using Metal Performance Shaders, but answers about other frameworks are also fine.

矩阵乘法在理论上是高度可并行化的运算.我需要自己乘以A′A(撇号代表换位)的许多矩阵.矩阵A的大小约为4000 x300.考虑到这些矩阵的大小,我想知道是否值得将乘法代码移植到GPU.据我了解,在GPU上进行乘法运算还涉及将数据从主内存复制到GPU内存(我使用的是eGPU,因此不共享内存).然后,必须在用于往返复制数据的额外工作与加快计算之间进行权衡.所以我的问题是:在什么尺寸的矩阵(大约)上,我可以开始看到在GPU上进行处理的好处?

Matrix multiplication is theoretically highly parallelisable operation. I need to multiply a lot of matrices by themselves like A’ A (where apostrophe stands for transposition). The size of the matrices A is about 4000 x 300. I was wondering if it’s worth porting the multiplication code to the GPU given the size of these matrices. As I understand, multiplying on GPU will also involve copying the data from main memory to GPU memory (I’m using eGPU, so the memory is not shared). Then there must be a trade off between additional effort for copying the data back and forth, and speed up in the calculations. So my question is: at what sizes of the matrices (approx) I could start to see the benefits of doing it on GPU?

P.S.也有这篇文章基本上说不要打扰,因为GPU并没有帮助,关于它的内存缓存很慢(通常在所有GPU上): https://graphics.stanford.edu/papers/gpumatrixmult/gpumatrixmult.pdf

P.S. There is also this article which basically says to not bother because GPU doesn’t help, something about its memory cache being slow (in general on all GPUs): https://graphics.stanford.edu/papers/gpumatrixmult/gpumatrixmult.pdf

推荐答案

我进行了测试,对于我来说,它在GPU上的速度要快得多(8-9倍),甚至包括从CPU到GPU的所有内存复制然后回来.我正在比较float32矩阵的乘法性能,因为Metal不支持float64.

I've made a test, and it's significantly faster (x 8-9) on GPU for my case, even including all the memory copying from CPU to GPU and back. I am comparing float32 matrix multiplication performance, since Metal doesn't support float64.

let count = 100

let N = 7005
let K = 700

let DIV = 8
let K2 = (K / DIV) * DIV + (K % DIV > 0 ? 1 : 0) * DIV
let N2 = (N / DIV) * DIV + (N % DIV > 0 ? 1 : 0) * DIV

print(N2)
print(K2)

printTimeElapsedWhenRunningCode(title: "vDSP(f)") {
    
    let ATf = [Float].init(repeating: Float(1), count: N*K)
    let Af = [Float].init(repeating: Float(1), count: N*K)
    var C = Array(repeating: Float(0), count: K*K)

    for _ in 0..<count {

        vDSP_mmul(ATf, 1,
                  Af, 1,
                  &C, 1,
                  vDSP_Length(K),
                  vDSP_Length(K),
                  vDSP_Length(N))
    }
}

guard let bufferA = device.makeBuffer(length: K2 * N2 * MemoryLayout<Float>.stride,
                                      options: [.storageModeManaged]) else {
    fatalError("Could not make buffer A")
}

guard let bufferC = device.makeBuffer(length: K2 * K2 * MemoryLayout<Float>.stride,
                                      options: [.storageModeManaged]) else {
    fatalError("Could not make buffer C")
}

let descA = MPSMatrixDescriptor(dimensions: N2,
                                columns: K2,
                                rowBytes: K2 * MemoryLayout<Float>.stride,
                                dataType: .float32)

let descC = MPSMatrixDescriptor(dimensions: K2,
                                columns: K2,
                                rowBytes: K2 * MemoryLayout<Float>.stride,
                                dataType: .float32)

let matrixA = MPSMatrix(buffer: bufferA, descriptor: descA)
let matrixC = MPSMatrix(buffer: bufferC, descriptor: descC)

let matrixMultiplication = MPSMatrixMultiplication(device: device,
                                                   transposeLeft: true,
                                                   transposeRight: false,
                                                   resultRows: K2,
                                                   resultColumns: K2,
                                                   interiorColumns: N2,
                                                   alpha: 1,
                                                   beta: 0)

guard let commandQueue = device.makeCommandQueue() else {
    fatalError("Could not make command queue")
}

printTimeElapsedWhenRunningCode(title: "Metal") {
    
    let Af = [Float].init(repeating: Float(1), count: N*K)
    let zeros = [Float].init(repeating: Float(0), count: K2)

    for i in 0..<count {

        var dest = bufferA.contents()
        Af.withUnsafeBufferPointer { pA in
            var from = pA.baseAddress!
            for _ in 0..<N {
                dest.copyMemory(from: from, byteCount: K)
                dest += K
                if K2 > K {
                    dest.copyMemory(from: zeros, byteCount: K2 - K)
                    dest += K2 - K
                }
                from += K
            }
        }
        
        for _ in 0..<(N2-N) {
            dest.copyMemory(from: zeros, byteCount: K2)
        }
        
        bufferA.didModifyRange(0..<N2*K2)
        
        let commandBuffer = commandQueue.makeCommandBuffer()!

        matrixMultiplication.encode(commandBuffer: commandBuffer,
                                    leftMatrix: matrixA,
                                    rightMatrix: matrixA,
                                    resultMatrix: matrixC)

        let blitEncoder = commandBuffer.makeBlitCommandEncoder()!
        blitEncoder.synchronize(resource: bufferC)
        blitEncoder.endEncoding()
        
        commandBuffer.commit()

        if i == count - 1 {
            commandBuffer.waitUntilCompleted()
        }
    }
}

输出:

AMD Radeon RX 5700 XT
7008
704
Time elapsed for vDSP(f): 5.156805992126465 s.
Time elapsed for Metal: 0.6834449768066406 s.
DONE.

这篇关于最小矩阵尺寸可受益于GPU上的矩阵乘法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆