在SSE2中使用%? [英] Using % with SSE2?

查看:78
本文介绍了在SSE2中使用%?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

这是我要转换为SSE2的代码:

Here's the code I'm trying to convert to SSE2:

double *pA = a;
double *pB = b[voiceIndex];
double *pC = c[voiceIndex];
double *left = audioLeft;
double *right = audioRight;
double phase = 0.0;
double bp0 = mNoteFrequency * mHostPitch;

for (int sampleIndex = 0; sampleIndex < blockSize; sampleIndex++) {
    // some other code (that will use phase)

    phase += std::clamp(mRadiansPerSample * (bp0 * pB[sampleIndex] + pC[sampleIndex]), 0.0, PI);

    while (phase >= TWOPI) { phase -= TWOPI; }
}

这就是我所取得的成就:

Here's what I've achieved:

double *pA = a;
double *pB = b[voiceIndex];
double *pC = c[voiceIndex];
double *left = audioLeft;
double *right = audioRight;
double phase = 0.0;
double bp0 = mNoteFrequency * mHostPitch;

__m128d v_boundLower = _mm_set1_pd(0.0);
__m128d v_boundUpper = _mm_set1_pd(PI);
__m128d v_bp0 = _mm_set1_pd(bp0);
__m128d v_radiansPerSample = _mm_set1_pd(mRadiansPerSample);

__m128d v_phase = _mm_set1_pd(phase);
__m128d v_pB = _mm_load_pd(pB);
__m128d v_pC = _mm_load_pd(pC);
__m128d v_result = _mm_mul_pd(v_bp0, v_pB);
v_result = _mm_add_pd(v_result, v_pC);
v_result = _mm_mul_pd(v_result, v_radiansPerSample);
v_result = _mm_max_pd(v_result, v_boundLower);
v_result = _mm_min_pd(v_result, v_boundUpper);

for (int sampleIndex = 0; sampleIndex < roundintup8(blockSize); sampleIndex += 8, pB += 8, pC += 8) {
    // some other code (that will use v_phase)

    v_phase = _mm_add_pd(v_phase, v_result);

    v_pB = _mm_load_pd(pB + 2);
    v_pC = _mm_load_pd(pC + 2);
    v_result = _mm_mul_pd(v_bp0, v_pB);
    v_result = _mm_add_pd(v_result, v_pC);
    v_result = _mm_mul_pd(v_result, v_radiansPerSample);
    v_result = _mm_max_pd(v_result, v_boundLower);
    v_result = _mm_min_pd(v_result, v_boundUpper);
    v_phase = _mm_add_pd(v_phase, v_result);

    v_pB = _mm_load_pd(pB + 4);
    v_pC = _mm_load_pd(pC + 4);
    v_result = _mm_mul_pd(v_bp0, v_pB);
    v_result = _mm_add_pd(v_result, v_pC);
    v_result = _mm_mul_pd(v_result, v_radiansPerSample);
    v_result = _mm_max_pd(v_result, v_boundLower);
    v_result = _mm_min_pd(v_result, v_boundUpper);
    v_phase = _mm_add_pd(v_phase, v_result);

    v_pB = _mm_load_pd(pB + 6);
    v_pC = _mm_load_pd(pC + 6);
    v_result = _mm_mul_pd(v_bp0, v_pB);
    v_result = _mm_add_pd(v_result, v_pC);
    v_result = _mm_mul_pd(v_result, v_radiansPerSample);
    v_result = _mm_max_pd(v_result, v_boundLower);
    v_result = _mm_min_pd(v_result, v_boundUpper);
    v_phase = _mm_add_pd(v_phase, v_result);

    v_pB = _mm_load_pd(pB + 8);
    v_pC = _mm_load_pd(pC + 8);
    v_result = _mm_mul_pd(v_bp0, v_pB);
    v_result = _mm_add_pd(v_result, v_pC);
    v_result = _mm_mul_pd(v_result, v_radiansPerSample);
    v_result = _mm_max_pd(v_result, v_boundLower);
    v_result = _mm_min_pd(v_result, v_boundUpper);

    // ... fmod?
}

但是我不确定如何替换while (phase >= TWOPI) { phase -= TWOPI; }(基本上是C ++中的经典fmod).

But I'm not really sure how to replace while (phase >= TWOPI) { phase -= TWOPI; } (which is basically a classic fmod in C++).

有没有喜欢的内在函数?在此列表上找不到任何内容. 师+某种火箭的移位?

Any fancy intrinsics? Can't find any on this list. Division + some sort of rocket bit-shifting?

推荐答案

正如评论所言,看起来您可以通过带有compare + andpd的蒙版减法来实现它.只要您不超过一个减法即可回到期望的范围.

As comments are saying, it looks like in this you can make it just a masked subtract with a compare + andpd. This works as long as you can never be more than one subtract away from getting back into the desired range.

喜欢

const __m128d v2pi = _mm_set1_pd(TWOPI);


__m128d needs_range_reduction = _mm_cmpge_pd(vphase, v2pi);
__m128d offset = _mm_and_pd(needs_range_reduction, v2pi);  // 0.0 or 2*Pi
vphase = _mm_sub_pd(vphase, offset);


要实现实际的(慢速)fmod,而不必过多担心有效位数的最后几位,则可以先执行integer_quotient = floor(x/y)(或者rint(x/y)ceil),然后执行x - y * integer_quotient.对于SSE4.1 _mm_round_pd_mm_floor_pd()floor/rint/ceil便宜.这样就可以得到余数,就像整数除法一样,它可以是负数.


To implement an actual (slow) fmod without worrying too much about the last few bits of the significand, you'd do integer_quotient = floor(x/y) (or maybe rint(x/y) or ceil), then x - y * integer_quotient. floor / rint / ceil are cheap with SSE4.1 _mm_round_pd or _mm_floor_pd(). That will give you the remainder, which can be negative just like with integer division.

我敢肯定,有一些数值技术可以更好地避免在灾难性抵消之前通过减去两个附近的数字来舍入误差.如果您在乎精度,请检查一下. (当您不太关心精度时使用double向量是很愚蠢的;不妨使用float并为每个向量完成两倍的工作).如果输入比模数大得多,则不可避免地会损失精度,并且最小化临时中的舍入误差可能非常重要.但是,除非x几乎是y的精确倍数,否则除非您关心结果的相对误差非常接近零,否则精度只会成为问题. (接近零的结果,仅剩下有效位数的后几位以保持精度.)

I'm sure there are numerical techniques that better avoid rounding error before the catastrophic cancellation from subtracting two nearby numbers. If you care about precision, go check. (Using double vectors when you don't care much about precision is kind of silly; might as well use float and get twice as much work done per vector). If the input is a lot larger than the modulus, there's an unavoidable loss of precision, and minimizing rounding error in the temporary is probably very important. But otherwise precision will only be a problem unless you care about relative error in results very close to zero when x is almost an exact multiple of y. (Near-zero result, the only the bottom few bits of the significand are left for precision.)

如果没有SSE4.1,则有一些技巧,例如添加然后减去足够大的数字.对于pd,转换为整数然后返回的情况甚至更糟,因为压缩转换指令也解码为一些随机码.更不用说32位整数不能覆盖double的整个范围,但是如果您的输入如此之大,则您会迷失范围缩小的精度.

Without SSE4.1, there are tricks like adding then subtracting a large enough number. Converting to integer and back is even worse for pd, because the packed-conversion instruction decodes to some shuffle uops as well. Not to mention that a 32-bit integer doesn't cover the full range of double, but you're screwed for range-reduction precision if your input was that huge.

如果您拥有 FMA ,则您可以避免在multi和sub的y * integer_quotient部分中舍入错误. _mm_fmsub_pd.

If you have FMA, you can avoid rounding error in the y * integer_quotient part of the multiply and sub. _mm_fmsub_pd.

这篇关于在SSE2中使用%?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆