如何加速iOS / Mac OS的金属代码 [英] How to Speed Up Metal Code for iOS/Mac OS

查看:89
本文介绍了如何加速iOS / Mac OS的金属代码的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试在Metal中实现代码,该代码在两个具有长度的向量之间执行一维卷积。我已经实现了以下正常工作

I'm trying to implement code in Metal that performs a 1D convolution between two vectors with lengths. I've implemented the following which works correctly

kernel void convolve(const device float *dataVector [[ buffer(0) ]],
                     const device int& dataSize [[ buffer(1) ]],
                     const device float *filterVector [[ buffer(2) ]],
                     const device int& filterSize [[ buffer(3) ]],
                     device float *outVector [[ buffer(4) ]],
                     uint id [[ thread_position_in_grid ]]) {
    int outputSize = dataSize - filterSize + 1;
    for (int i=0;i<outputSize;i++) {
        float sum = 0.0;
        for (int j=0;j<filterSize;j++) {
            sum += dataVector[i+j] * filterVector[j];
        }
        outVector[i] = sum;
    }
}

我的问题是需要大约10倍的处理时间(计算+与GPU之间的数据传输)使用Metal的数据与在CPU上的Swift相同。我的问题是如何用单个向量操作替换内部循环还是有另一种方法来加速上面的代码?

My problem is it takes about 10 times longer to process (computation + data transfer to/from GPU) the same data using Metal than in Swift on a CPU. My question is how do I replace the inner loop with a single vector operation or is there another way to speed up the above code?

推荐答案

<在这种情况下利用GPU并行性的关键是让它为你管理外部循环。我们不是为整个数据向量调用一次内核,而是为数据向量中的每个元素调用它。内核函数简化为:

The key to taking advantage of the GPU's parallelism in this case is to let it manage the outer loop for you. Instead of invoking the kernel once for the entire data vector, we'll invoke it for each element in the data vector. The kernel function simplifies to this:

kernel void convolve(const device float *dataVector [[ buffer(0) ]],
                     const constant int &dataSize [[ buffer(1) ]],
                     const constant float *filterVector [[ buffer(2) ]],
                     const constant int &filterSize [[ buffer(3) ]],
                     device float *outVector [[ buffer(4) ]],
                     uint id [[ thread_position_in_grid ]])
{
    float sum = 0.0;
    for (int i = 0; i < filterSize; ++i) {
        sum += dataVector[id + i] * filterVector[i];
    }
    outVector[id] = sum;
}

为了分派这项工作,我们根据线程选择一个线程组大小计算管道状态建议的执行宽度。这里一个棘手的问题是确保输入和输出缓冲区中有足够的填充,以便我们可以稍微超出数据的实际大小。这确实会导致我们浪费少量的内存和计算,但是为了节省我们执行单独调度的复杂性,只是为了计算缓冲区末尾元素的卷积。

In order to dispatch this work, we select a threadgroup size based on the thread execution width recommended by the compute pipeline state. The one tricky thing here is making sure that there's enough padding in the input and output buffers so that we can slightly overrun the actual size of the data. This does cause us to waste a small amount of memory and computation, but saves us the complexity of doing a separate dispatch just to compute the convolution for the elements at the end of the buffer.

// We should ensure here that the data buffer and output buffer each have a size that is a multiple of
// the compute pipeline's threadExecutionWidth, by padding the amount we allocate for each of them.
// After execution, we ignore the extraneous elements in the output buffer beyond the first (dataCount - filterCount + 1).

let iterationCount = dataCount - filterCount + 1
let threadsPerThreadgroup = MTLSize(width: min(iterationCount, computePipeline.threadExecutionWidth), height: 1, depth: 1)
let threadgroups = (iterationCount + threadsPerThreadgroup.width - 1) / threadsPerThreadgroup.width
let threadgroupsPerGrid = MTLSize(width: threadgroups, height: 1, depth: 1)

let commandEncoder = commandBuffer.computeCommandEncoder()
commandEncoder.setComputePipelineState(computePipeline)
commandEncoder.setBuffer(dataBuffer, offset: 0, at: 0)
commandEncoder.setBytes(&dataCount, length: MemoryLayout<Int>.stride, at: 1)
commandEncoder.setBuffer(filterBuffer, offset: 0, at: 2)
commandEncoder.setBytes(&filterCount, length: MemoryLayout<Int>.stride, at: 3)
commandEncoder.setBuffer(outBuffer, offset: 0, at: 4)
commandEncoder.dispatchThreadgroups(threadgroupsPerGrid, threadsPerThreadgroup: threadsPerThreadgroup)
commandEncoder.endEncoding()

在我的实验中,这种并行方法运行 400-1000x 比问题中的序列版本。我很想知道它与你的CPU实现相比如何。

In my experiments, this parallelized approach runs 400-1000x faster than the serial version in the question. I'm curious to hear how it compares to your CPU implementation.

这篇关于如何加速iOS / Mac OS的金属代码的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆