是否可以在计算中对序列相关性使用SIMD,例如指数移动平均滤波器? [英] Is it possible to use SIMD on a serial dependency in a calculation, like an exponential moving average filter?
问题描述
我正在处理多个(独立)指数移动平均值 1极<我在音频应用程序中具有的不同参数上的href ="http://musicdsp.org/archive.php?classid=3#257" rel ="nofollow noreferrer">过滤器,目的是使每个参数平滑音频速率下的参数值:
I'm processing multiple (independent) Exponential Moving Average 1-Pole filters on different parameters I have within my Audio application, with the intent of smooth each param value at audio rate:
for (int i = 0; i < mParams.GetSize(); i++) {
mParams.Get(i)->SmoothBlock(blockSize);
}
...
inline void SmoothBlock(int blockSize) {
double inputA0 = mValue * a0;
for (int sampleIndex = 0; sampleIndex < blockSize; sampleIndex++) {
mSmoothedValues[sampleIndex] = z1 = inputA0 + z1 * b1;
}
}
我想利用CPU SIMD
指令来并行处理它们,但是我不确定如何实现这一目标.
I'd like to take advantage of CPU SIMD
instructions, processing them in parallel, but I'm not really sure how I can achieve this.
实际上,z1
是递归的:不能考虑先前值"对双精度"数组进行打包",对吧?
In fact, z1
is recursive: can't "pack" array of double considering "previous values", right?
也许有一种方法可以正确地组织不同过滤器的数据并并行处理它们?
Maybe is there a way to properly organize data of different filters and process them in parallel?
任何提示或建议都将受到欢迎!
Any tips or suggestions would be welcome!
请注意:我没有几个信号路径.任何参数代表对(唯一)处理信号的不同控制.假设我有一个信号:参数1将影响增益,参数2的音高,参数3的滤波器截止,参数4的声像等等.
Please note: I don't have several signals paths. Any params represent different controls for the (unique) processing signal. Say I've a sin signal: param 1 will affect gain, param 2 pitch, param 3 filter cutoff, param 4 pan, and so on.
推荐答案
如果前面的n
步骤有一个封闭式公式,则可以使用它来避免串行依赖性.如果可以使用与第一步相同的运算,仅使用不同的系数进行计算,则只需要广播.
If there's a closed-form formula for n
steps ahead, you can use that to sidestep serial dependencies. If it can be computed with just different coefficients with the same operations as for 1 step, a broadcast is all you need.
就像在这种情况下的z1 = c + z1 * b
,所以应用两次,我们得到
Like in this case, z1 = c + z1 * b
, so applying that twice we get
# I'm using Z0..n as the elements in the series your loop calculates
Z2 = c + (c+Z0*b)*b
= c + c*b + Z0*b^2
c + c*b
和b^2
都是常量,如果我正确理解您的代码,则所有C变量实际上只是C变量,而不是数组引用的伪代码. (因此,除了z1
之外的所有内容都是循环不变的.)
c + c*b
and b^2
are both constants, if I'm understanding your code correctly that all the C variables are really just C variables, not pseudocode for an array reference. (So everything except your z1
are loop invariant).
因此,如果我们有一个由2个元素组成的SIMD向量(从Z0和Z1开始),则可以将它们的每个向前加2以得到Z2和Z3.
So if we have a SIMD vector of 2 elements, starting with Z0 and Z1, we can step each of them forward by 2 to get Z2 and Z3.
void SmoothBlock(int blockSize, double b, double c, double z_init) {
// z1 = inputA0 + z1 * b1;
__m128d zv = _mm_setr_pd(z_init, z_init*b + c);
__m128d step2_mul = _mm_set1_pd(b*b);
__m128d step2_add = _mm_set1_pd(c + c*b);
for (int i = 0; i < blockSize-1; i+=2) {
_mm_storeu_pd(mSmoothedValues + i, zv);
zv = _mm_mul_pd(zv, step2_mul);
zv = _mm_add_pd(zv, step2_add);
// compile with FMA + fast-math for the compiler to fold the mul/add into one FMA
}
// handle final odd element if necessary
if(blockSize%2 != 0)
_mm_store_sd(mSmoothedValues+blockSize-1, zv);
}
使用float
+ AVX(每个矢量8个元素),您将拥有
With float
+ AVX (8 elements per vector), you'd have
__m256 zv = _mm256_setr_ps(z_init, c + z_init*b,
c + c*b + z_init*b*b,
c + c*b + c*b*b + z_init*b*b*b, ...);
// Z2 = c + c*b + Z0*b^2
// Z3 = c + c*b + (c + Z0*b) * b^2
// Z3 = c + c*b + c*b^2 + Z0*b^3
和加/乘因子为8个步骤.
and the add/mul factors would be for 8 steps.
通常,人们将float
用于SIMD,因为每个向量获得的元素数量是原来的两倍,而内存带宽/缓存占用空间却只有一半,因此,与float
相比,通常获得2倍的提速. (相同数量的向量/每个时钟字节).
Normally people use float
for SIMD because you get twice as many elements per vector, and half the memory bandwidth / cache footprint, so you typically get a factor of 2 speedup over float
. (Same number of vectors / bytes per clock.)
例如,在Haswell或Sandybridge上的上述循环将以每8个周期一个向量运行,瓶颈是mulpd
(5个周期)+ addps
(3个周期)的延迟.我们每个向量生成2个double
结果,但是与每个时钟吞吐量1 mul和1 add相比,这仍然是一个巨大的瓶颈.我们错过了8倍的吞吐量.
The above loop on a Haswell or Sandybridge for example CPU will run at one vector per 8 cycles, bottlenecked on the latency of mulpd
(5 cycles) + addps
(3 cycles). We generate 2 double
results per vector, but that's still a huge bottleneck compared to 1 mul and 1 add per clock throughput. We're missing out on a factor of 8 throughput.
(或者如果使用一个FMA而不是mul-> add进行编译,则我们有5个周期的延迟).
(Or if compiled with one FMA instead of mul->add, then we have 5 cycle latency).
避开串行依赖关系不仅对SIMD有用,而且对于避免FP add/mul(或FMA)延迟的瓶颈将进一步提高速度,达到FP add/mul延迟与add + mul吞吐率之比.
只需展开更多,并使用多个向量,例如zv0
,zv1
,zv2
,zv3
.这也增加了您一次执行的步骤数.因此,例如float
的16字节向量和4个向量将是4x4 = 16步.
Simply unroll more, and use multiple vectors, like zv0
, zv1
, zv2
, zv3
. This increases the number of steps you make at once, too. So for example, 16-byte vectors of float
, with 4 vectors, would be 4x4 = 16 steps.
这篇关于是否可以在计算中对序列相关性使用SIMD,例如指数移动平均滤波器?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!