是否可以在计算中对序列相关性使用SIMD,例如指数移动平均滤波器? [英] Is it possible to use SIMD on a serial dependency in a calculation, like an exponential moving average filter?

查看:97
本文介绍了是否可以在计算中对序列相关性使用SIMD,例如指数移动平均滤波器?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在处理多个(独立)指数移动平均值 1极<我在音频应用程序中具有的不同参数上的href ="http://musicdsp.org/archive.php?classid=3#257" rel ="nofollow noreferrer">过滤器,目的是使每个参数平滑音频速率下的参数值:

I'm processing multiple (independent) Exponential Moving Average 1-Pole filters on different parameters I have within my Audio application, with the intent of smooth each param value at audio rate:

for (int i = 0; i < mParams.GetSize(); i++) {
    mParams.Get(i)->SmoothBlock(blockSize);
}

...

inline void SmoothBlock(int blockSize) {
    double inputA0 = mValue * a0;

    for (int sampleIndex = 0; sampleIndex < blockSize; sampleIndex++) {
        mSmoothedValues[sampleIndex] = z1 = inputA0 + z1 * b1;
    }
}   

我想利用CPU SIMD指令来并行处理它们,但是我不确定如何实现这一目标.

I'd like to take advantage of CPU SIMD instructions, processing them in parallel, but I'm not really sure how I can achieve this.

实际上,z1是递归的:不能考虑先前值"对双精度"数组进行打包",对吧?

In fact, z1 is recursive: can't "pack" array of double considering "previous values", right?

也许有一种方法可以正确地组织不同过滤器的数据并并行处理它们?

Maybe is there a way to properly organize data of different filters and process them in parallel?

任何提示或建议都将受到欢迎!

Any tips or suggestions would be welcome!

请注意:我没有几个信号路径.任何参数代表对(唯一)处理信号的不同控制.假设我有一个信号:参数1将影响增益,参数2的音高,参数3的滤波器截止,参数4的声像等等.

Please note: I don't have several signals paths. Any params represent different controls for the (unique) processing signal. Say I've a sin signal: param 1 will affect gain, param 2 pitch, param 3 filter cutoff, param 4 pan, and so on.

推荐答案

如果前面的n步骤有一个封闭式公式,则可以使用它来避免串行依赖性.如果可以使用与第一步相同的运算,仅使用不同的系数进行计算,则只需要广播.

If there's a closed-form formula for n steps ahead, you can use that to sidestep serial dependencies. If it can be computed with just different coefficients with the same operations as for 1 step, a broadcast is all you need.

就像在这种情况下的z1 = c + z1 * b,所以应用两次,我们得到

Like in this case, z1 = c + z1 * b, so applying that twice we get

# I'm using Z0..n as the elements in the series your loop calculates
Z2 = c + (c+Z0*b)*b
   = c + c*b + Z0*b^2

c + c*bb^2都是常量,如果我正确理解您的代码,则所有C变量实际上只是C变量,而不是数组引用的伪代码. (因此,除了z1之外的所有内容都是循环不变的.)

c + c*b and b^2 are both constants, if I'm understanding your code correctly that all the C variables are really just C variables, not pseudocode for an array reference. (So everything except your z1 are loop invariant).

因此,如果我们有一个由2个元素组成的SIMD向量(从Z0和Z1开始),则可以将它们的每个向前加2以得到Z2和Z3.

So if we have a SIMD vector of 2 elements, starting with Z0 and Z1, we can step each of them forward by 2 to get Z2 and Z3.

void SmoothBlock(int blockSize, double b, double c, double z_init) {

    // z1 = inputA0 + z1 * b1;
    __m128d zv = _mm_setr_pd(z_init, z_init*b + c);

    __m128d step2_mul = _mm_set1_pd(b*b);
    __m128d step2_add = _mm_set1_pd(c + c*b);

    for (int i = 0; i < blockSize-1; i+=2) {
        _mm_storeu_pd(mSmoothedValues + i, zv);
        zv = _mm_mul_pd(zv, step2_mul);
        zv = _mm_add_pd(zv, step2_add);
       // compile with FMA + fast-math for the compiler to fold the mul/add into one FMA
    }
    // handle final odd element if necessary
    if(blockSize%2 != 0)
        _mm_store_sd(mSmoothedValues+blockSize-1, zv);
}

使用float + AVX(每个矢量8个元素),您将拥有

With float + AVX (8 elements per vector), you'd have

__m256 zv = _mm256_setr_ps(z_init, c + z_init*b,
                         c + c*b + z_init*b*b,
                         c + c*b + c*b*b + z_init*b*b*b, ...);

// Z2 = c + c*b + Z0*b^2
// Z3 = c + c*b + (c + Z0*b) * b^2
// Z3 = c + c*b + c*b^2 + Z0*b^3

和加/乘因子为8个步骤.

and the add/mul factors would be for 8 steps.

通常,人们将float用于SIMD,因为每个向量获得的元素数量是原来的两倍,而内存带宽/缓存占用空间却只有一半,因此,与float相比,通常获得2倍的提速. (相同数量的向量/每个时钟字节).

Normally people use float for SIMD because you get twice as many elements per vector, and half the memory bandwidth / cache footprint, so you typically get a factor of 2 speedup over float. (Same number of vectors / bytes per clock.)

例如,在Haswell或Sandybridge上的上述循环将以每8个周期一个向量运行,瓶颈是mulpd(5个周期)+ addps(3个周期)的延迟.我们每个向量生成2个double结果,但是与每个时钟吞吐量1 mul和1 add相比,这仍然是一个巨大的瓶颈.我们错过了8倍的吞吐量.

The above loop on a Haswell or Sandybridge for example CPU will run at one vector per 8 cycles, bottlenecked on the latency of mulpd (5 cycles) + addps (3 cycles). We generate 2 double results per vector, but that's still a huge bottleneck compared to 1 mul and 1 add per clock throughput. We're missing out on a factor of 8 throughput.

(或者如果使用一个FMA而不是mul-> add进行编译,则我们有5个周期的延迟).

(Or if compiled with one FMA instead of mul->add, then we have 5 cycle latency).

避开串行依赖关系不仅对SIMD有用,而且对于避免FP add/mul(或FMA)延迟的瓶颈将进一步提高速度,达到FP add/mul延迟与add + mul吞吐率之比.

只需展开更多,并使用多个向量,例如zv0zv1zv2zv3.这也增加了您一次执行的步骤数.因此,例如float的16字节向量和4个向量将是4x4 = 16步.

Simply unroll more, and use multiple vectors, like zv0, zv1, zv2, zv3. This increases the number of steps you make at once, too. So for example, 16-byte vectors of float, with 4 vectors, would be 4x4 = 16 steps.

这篇关于是否可以在计算中对序列相关性使用SIMD,例如指数移动平均滤波器?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆