没有 SSE4.1 的高效 SSE FP `floor()`/`ceil()`/`round()` 舍入函数? [英] Efficient SSE FP `floor()` / `ceil()` / `round()` Rounding Functions Without SSE4.1?

查看:107
本文介绍了没有 SSE4.1 的高效 SSE FP `floor()`/`ceil()`/`round()` 舍入函数?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

如何像这些函数一样将 __m128 浮点数向量向上/向下或最接近的整数舍入?

How can I round a __m128 vector of floats up/down or to the nearest integer, like these functions?

我需要没有 SSE4.1 roundps (_mm_floor_ps/_mm_ceil_ps/_mm_round_ps(x, _MM_FROUND_TO_NEAREST_INT |_MM_FROUND_NO_EXC).roundps 也可以向零截断,但我不需要这个应用程序.

I need to do this without SSE4.1 roundps (_mm_floor_ps / _mm_ceil_ps / _mm_round_ps(x, _MM_FROUND_TO_NEAREST_INT |_MM_FROUND_NO_EXC). roundps can also truncate toward zero, but I don't need that for this application.

我可以使用 SSE3 及更早版本.(无 SSSE3 或 SSE4)

I can use SSE3 and earlier. (No SSSE3 or SSE4)

所以函数声明应该是这样的:

So the function declaration would be something like:

__m128 RoundSse( __m128 x ), __m128 CeilSse( __m128 x )__m128 FloorSse( __m128 x ).

推荐答案

我发布了来自 http://dss.stephanierct.com/DevBlog/?p=8:

应该采用By Value形式(我只是把代码中的&去掉了,不确定是否可以):

It should be adopted into By Value form (I just removed the & from the code, not sure it is OK):

static inline __m128 FloorSse(const __m128 x) {
    __m128i v0 = _mm_setzero_si128();
    __m128i v1 = _mm_cmpeq_epi32(v0, v0);
    __m128i ji = _mm_srli_epi32(v1, 25);
    __m128i tmp = _mm_slli_epi32(ji, 23); // I edited this (Added tmp) not sure about it
    __m128 j = _mm_castsi128_ps(tmp); //create vector 1.0f // I edited this not sure about it
    __m128i i = _mm_cvttps_epi32(x);
    __m128 fi = _mm_cvtepi32_ps(i);
    __m128 igx = _mm_cmpgt_ps(fi, x);
    j = _mm_and_ps(igx, j);
    return _mm_sub_ps(fi, j);
}

static inline __m128 CeilSse(const __m128 x) {
    __m128i v0 = _mm_setzero_si128();
    __m128i v1 = _mm_cmpeq_epi32(v0, v0);
    __m128i ji = _mm_srli_epi32(v1, 25);
    __m128i tmp = _mm_slli_epi32(ji, 23); // I edited this (Added tmp) not sure about it
    __m128 j = _mm_castsi128_ps(tmp); //create vector 1.0f // I edited this not sure about it
    __m128i i = _mm_cvttps_epi32(x);
    __m128 fi = _mm_cvtepi32_ps(i);
    __m128 igx = _mm_cmplt_ps(fi, x);
    j = _mm_and_ps(igx, j);
    return _mm_add_ps(fi, j);
}

static inline __m128 RoundSse(const __m128 a) {
    __m128 v0 = _mm_setzero_ps();             //generate the highest value < 2
    __m128 v1 = _mm_cmpeq_ps(v0, v0);
    __m128i tmp = _mm_castps_si128(v1); // I edited this (Added tmp) not sure about it
    tmp = _mm_srli_epi32(tmp, 2); // I edited this (Added tmp) not sure about it
    __m128 vNearest2 = _mm_castsi128_ps(tmp); // I edited this (Added tmp) not sure about it
    __m128i i = _mm_cvttps_epi32(a);
    __m128 aTrunc = _mm_cvtepi32_ps(i);        // truncate a
    __m128 rmd = _mm_sub_ps(a, aTrunc);        // get remainder
    __m128 rmd2 = _mm_mul_ps(rmd, vNearest2); // mul remainder by near 2 will yield the needed offset
    __m128i rmd2i = _mm_cvttps_epi32(rmd2);    // after being truncated of course
    __m128 rmd2Trunc = _mm_cvtepi32_ps(rmd2i);
    __m128 r = _mm_add_ps(aTrunc, rmd2Trunc);
    return r;
}


inline __m128 ModSee(const __m128 a, const __m128 aDiv) {
    __m128 c = _mm_div_ps(a, aDiv);
    __m128i i = _mm_cvttps_epi32(c);
    __m128 cTrunc = _mm_cvtepi32_ps(i);
    __m128 base = _mm_mul_ps(cTrunc, aDiv);
    __m128 r = _mm_sub_ps(a, base);
    return r;
}

这篇关于没有 SSE4.1 的高效 SSE FP `floor()`/`ceil()`/`round()` 舍入函数?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆