从二阶导数计算的曲线的 SIMD 优化 [英] SIMD optimization of a curve computed from the second derivative

查看:82
本文介绍了从二阶导数计算的曲线的 SIMD 优化的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

这个问题真是好奇.

我正在将一个例程转换为 SIMD 指令(我对 SIMD 编程还很陌生),并且在使用以下代码时遇到了问题:

I was converting a routine into SIMD instructions (and I am quite new to SIMD programming), and had trouble with the following bit of code:

// args:
uint32_t phase_current;
uint32_t phase_increment;
uint32_t phase_increment_step;

for (int i = 0; i < blockSize; ++i)
{
    USEFUL_FUNC(phase_current);
    phase_increment += phase_increment_step;
    phase_current += phase_increment;
}

问题:假设 USEFUL_FUNC 有一个 SIMD 实现,我只是想计算一个正确的 phase_current 向量进行处理,正确的方法是什么处理依赖于其先前值的 phase_current 的问题?

The Question: Assuming that USEFUL_FUNC has a SIMD implementation and I am simply trying to compute a correct vector of phase_current for processing, what is the right way of dealing with the phase_current being dependent on its previous value?

反过来,类似函数式编程 fold 的实现也同样有用,因为我试图了解如何提升数据依赖,而我正试图优化优化.

In turn, a functional programming fold-like implementation would be similarly useful, since I'm trying to understand how to hoist out a data dependency more that I am trying to optimize for the sake of optimizing.

最后,如果你能推荐一些文献,请推荐.不知道如何在 Google 上搜索此主题.

Lastly, if you can recommend some literature, please do. Not sure how to Google for this topic.

推荐答案

所以您只是在寻找一种方法来生成 4 个 phase_current 值的向量,您可以将其作为 arg 传递给任意函数.

So you're just looking for a way to generate vectors of 4 phase_current values, which you can pass as an arg to an arbitrary function.

TL:DR:设置增量和步长的初始向量,使每个向量元素跨序列 4,为您提供 phase_current[i+0..i+ 的向量3] 仍然只有两个向量 ADD 操作(垂直,不是水平).这种串行依赖是您可以通过代数/数学分解出来的.

TL:DR: set up the initial vectors for increment and step so each vector element strides through the sequence by 4, giving you vectors of phase_current[i+0..i+3] with still only two vector ADD operations (vertical, not horizontal). This serial dependency is one you can factor out with algebra / math.

这有点像前缀总和(您可以使用 log2(vector_width) 对带有 vector_width 元素 的向量进行混洗+加法运算.)您还可以使用两步计算并行化多个线程的前缀和,其中每个线程前缀- 对数组的一个区域求和,然后组合结果并使每个线程将其目标数组的区域偏移一个常数(该区域的第一个元素的总和.也请参阅多线程的链接问题.

This is a bit like a prefix-sum (which you can SIMD with log2(vector_width) shuffle+add operations for vectors with vector_width elements.) You can also parallelize prefix sums with multiple threads using a two-step calculation where each thread prefix-sums a region of an array, then combine the results and have each thread offset its region of the destination array by a constant (the sum total for the first element of that region. See the linked question for multi-threading, too.

但是你有一个巨大的简化,phase_increment_step(你想要的值的二阶导数)是常数.我假设 USEFUL_FUNC(phase_current); 通过值而不是非常量引用来获取其 arg,因此对 phase_current 的唯一修改是通过 += 在循环中.而且 useful_func 不能以某种方式改变 increment 或 increment_step.

But you have the huge simplification that phase_increment_step (the 2nd derivative of the value you want) is constant. I'm assuming that USEFUL_FUNC(phase_current); takes its arg by value, not by non-const reference, so the only modification to phase_current is by the += in the loop. And that useful_func can't somehow mutate the increment or increment_step.

实现这一点的一种选择是在 SIMD 向量的 4 个独立元素中独立运行标量算法,每次偏移 1 次迭代.使用整数加法,尤其是在向量整数加法延迟仅为 1 个周期的 Intel CPU 上,运行运行总计的 4 次迭代很便宜,我们可以在调用 USEFUL_FUNC 之间这样做.这将是一种为 USEFUL_FUNC 生成与标量代码完全一样多的工作的向量输入的方法(假设 SIMD 整数加法与标量整数加法一样便宜,如果我们受数据依赖性限制为每个时钟 2 次加法,则这通常是正确的).

One option to implement this is just to run the scalar algorithm independently in 4 separate elements of SIMD vectors, offset by 1 iteration each time. With integer adds, especially on Intel CPUs where vector-integer add latency is only 1 cycle, running 4 iterations of the running-total is cheap, and we could just do that between calls to USEFUL_FUNC. That would be a way to generate vector inputs to USEFUL_FUNC doing exactly as much work as scalar code (assuming SIMD integer add is as cheap as scalar integer add, which is mostly true if we're limited by the data dependency to 2 adds per clock).

上述方法在某种程度上更通用,对于这个问题的变体可能有用,因为我们无法通过简单的数学运算廉价地消除真正的串行依赖.

The above method is somewhat more general and could be useful for variations on this problem where there's a true serial dependency that we can't eliminate cheaply with simple math.

如果我们很聪明,我们可以做得比前缀和或一次一个步骤运行 4 个序列的蛮力更好.理想情况下,我们可以推导出一种封闭形式的方法,以在值序列中步长 4(或无论 SIMD 向量宽度是多少,乘以 USEFUL_FUNC).

If we're clever, we can do even better than a prefix sum or brute-force running 4 sequences one step at a time. Ideally we can derive a closed-form way to step by 4 in the sequence of values (or whatever the SIMD vector width is, times whatever unroll factor you want for multiple accumulators for USEFUL_FUNC).

stepstep*2step*3、... 的序列求和将给我们一个常数次 高斯的闭式公式,用于整数之和<代码>n:sum(1..n) = n*(n+1)/2.这个序列是 0, 1, 3, 6, 10, 15, 21, 28, ... (https://oeis.org/A000217).(我已经分解出初始的phase_increment).

Summing a sequence of step, step*2, step*3, ... will give us a constant times Gauss's closed-form formula for the sum of integers up to n: sum(1..n) = n*(n+1)/2. This sequence goes 0, 1, 3, 6, 10, 15, 21, 28, ... (https://oeis.org/A000217). (I've factored out the initial phase_increment).

在这个序列中,技巧是 4.(n+4)*(n+5)/2 - n*(n+1)/2 简化为4*n + 10.再次求导,我们得到 4.但是要在第二个积分中走 4 步,我们有 4*4 = 16.所以我们可以维护一个向量phase_increment,我们用一个向量16*phase_increment_step的SIMD加法来增加它.

The trick is going by 4 in this sequence. (n+4)*(n+5)/2 - n*(n+1)/2 simplifies down to 4*n + 10. Taking the derivative of that again, we get 4. But to go 4 steps in the 2nd integral, we have 4*4 = 16. So we can maintain a vector phase_increment which we increment with a SIMD add with a vector of 16*phase_increment_step.

我不完全确定我的计步推理是正确的(4 的额外因子给出 16 有点令人惊讶).计算出正确的公式并在向量序列中进行一阶差分和二阶差分可以非常清楚这是如何实现的:

I'm not totally sure I have the step-counting reasoning right (the extra factor of 4 to give 16 is a bit surprising). Working out the right formulas and taking first and second-differences in the sequence of vectors makes it very clear how this works out:

 // design notes, working through the first couple vectors
 // to prove this works correctly.

S = increment_step (constant)
inc0 = increment initial value
p0 = phase_current initial value

// first 8 step-increases:
[ 0*S,  1*S,   2*S,  3*S ]
[ 4*S,  5*S,   6*S,  7*S ]

// first vector of 4 values:
[ p0,  p0+(inc0+S),  p0+(inc0+S)+(inc0+2*S),  p0+(inc0+S)+(inc0+2*S)+(inc0+3*S) ]
[ p0,  p0+inc0+S,  p0+2*inc0+3*S,  p0+3*inc0+6*S ]  // simplified

// next 4 values:
[ p0+4*inc0+10*S,  p0+5*inc0+15*S,  p0+6*inc0+21*S,  p0+7*inc0+28*S ]

使用这个和之前的 4*n + 10 公式:

Using this and the earlier 4*n + 10 formula:

// first 4 vectors of of phase_current
[ p0,              p0+1*inc0+ 1*S,  p0+2*inc0+3*S,   p0+ 3*inc0+ 6*S ]
[ p0+4*inc0+10*S,  p0+5*inc0+15*S,  p0+6*inc0+21*S,  p0+ 7*inc0+28*S ]
[ p0+8*inc0+36*S,  p0+9*inc0+45*S,  p0+10*inc0+55*S, p0+11*inc0+66*S ]
[ p0+12*inc0+78*S,  p0+13*inc0+91*S,  p0+14*inc0+105*S, p0+15*inc0+120*S ]

 first 3 vectors of phase_increment (subtract consecutive phase_current vectors):
[ 4*inc0+10*S,     4*inc0 + 14*S,   4*inc0 + 18*S,   4*inc0 + 22*S  ]
[ 4*inc0+26*S,     4*inc0 + 30*S,   4*inc0 + 34*S,   4*inc0 + 38*S  ]
[ 4*inc0+42*S,     4*inc0 + 46*S,   4*inc0 + 50*S,   4*inc0 + 54*S  ]

 first 2 vectors of phase_increment_step:
[        16*S,              16*S,            16*S,            16*S  ]
[        16*S,              16*S,            16*S,            16*S  ]
Yes, as expected, a constant vector works for phase_increment_step

因此我们可以使用英特尔的 SSE/AVX 内在函数编写这样的代码:

#include <stdint.h>
#include <immintrin.h>

void USEFUL_FUNC(__m128i);

// TODO: more efficient generation of initial vector values
void double_integral(uint32_t phase_start, uint32_t phase_increment_start, uint32_t phase_increment_step, unsigned blockSize)
{

    __m128i pstep1 = _mm_set1_epi32(phase_increment_step);

    // each vector element steps by 4
    uint32_t inc0=phase_increment_start, S=phase_increment_step;
    __m128i pincr  = _mm_setr_epi32(4*inc0 + 10*S,  4*inc0 + 14*S,   4*inc0 + 18*S,   4*inc0 + 22*S);

    __m128i phase = _mm_setr_epi32(phase_start,  phase_start+1*inc0+ 1*S,  phase_start+2*inc0+3*S,   phase_start + 3*inc0+ 6*S );
     //_mm_set1_epi32(phase_start); and add.
     // shuffle to do a prefix-sum initializer for the first vector?  Or SSE4.1 pmullo by a vector constant?

    __m128i pstep_stride = _mm_slli_epi32(pstep1, 4);  // stride by pstep * 16
    for (unsigned i = 0; i < blockSize; ++i)  {
        USEFUL_FUNC(phase);
        pincr = _mm_add_epi32(pincr, pstep_stride);
        phase = _mm_add_epi32(phase, pincr);
    }
}

进一步阅读:有关 SIMD 的更多信息,但主要是 x86 SSE/AVX,请参阅 https://stackoverflow.com/tags/sse/info,尤其是来自 Insomniac Games (GDC 2015) 上的 SIMD,其中有一些关于如何总体考虑 SIMD 以及如何布置数据以便您可以使用它的好东西.

Further reading: for more about SIMD in general, but mostly x86 SSE/AVX, see https://stackoverflow.com/tags/sse/info, especially slides from SIMD at Insomniac Games (GDC 2015) which have some good stuff about how to think about SIMD in general, and how to lay out your data so you can use it.

这篇关于从二阶导数计算的曲线的 SIMD 优化的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆