加快浮球的时间? [英] Speedup a short to float cast?

查看:65
本文介绍了加快浮球的时间?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在C ++中有一个简短的浮空转换,这使我的代码成为瓶颈。

I have a short to float cast in C++ that is bottlenecking my code.

代码是从硬件设备缓冲区转换而来的,该缓冲区本来就是短裤,这代表输入

The code translates from a hardware device buffer which is natively shorts, this represents the input from a fancy photon counter.

float factor=  1.0f/value;
for (int i = 0; i < W*H; i++)//25% of time is spent doing this
{
    int value = source[i];//ushort -> int
    destination[i] = value*factor;//int*float->float
}

一些详细信息


  1. 值应从0到2 ^ 16-1,它代表了一个高感光度相机的像素值

  1. Value should go from 0 to 2^16-1, it represents the pixel values of a highly sensitive camera

我在装有i7处理器(i7 960是SSE)的多核x86机器上4.2和4.1)。

I'm on a multicore x86 machine with an i7 processor (i7 960 which is SSE 4.2 and 4.1).

源与8位边界对齐(硬件设备的要求)

Source is aligned to an 8 bit boundary (a requirement of the hardware device)

W * H始终可以被8整除,大多数情况下,W和H可以被8整除

W*H is always divisible by 8, most of the time W and H are divisible by 8

这让我很难过,我能做些什么吗?

This makes me sad, is there anything I can do about it?

我正在使用Visual Studios 2012 ...

I am using Visual Studios 2012...

推荐答案

以下是基本的SSE4.1实现:

Here's a basic SSE4.1 implementation:

__m128 factor = _mm_set1_ps(1.0f / value);
for (int i = 0; i < W*H; i += 8)
{
    //  Load 8 16-bit ushorts.
    //  vi = {a,b,c,d,e,f,g,h}
    __m128i vi = _mm_load_si128((const __m128i*)(source + i));

    //  Convert to 32-bit integers
    //  vi0 = {a,0,b,0,c,0,d,0}
    //  vi1 = {e,0,f,0,g,0,h,0}
    __m128i vi0 = _mm_cvtepu16_epi32(vi);
    __m128i vi1 = _mm_cvtepu16_epi32(_mm_unpackhi_epi64(vi,vi));

    //  Convert to float
    __m128 vf0 = _mm_cvtepi32_ps(vi0);
    __m128 vf1 = _mm_cvtepi32_ps(vi1);

    //  Multiply
    vf0 = _mm_mul_ps(vf0,factor);
    vf1 = _mm_mul_ps(vf1,factor);

    //  Store
    _mm_store_ps(destination + i + 0,vf0);
    _mm_store_ps(destination + i + 4,vf1);
}

这假设:


  1. 目标都对齐为16个字节。

  2. W * H 是8的倍数。

  1. source and destination are both aligned to 16 bytes.
  2. W*H is a multiple of 8.

可以通过进一步展开此循环来做得更好。 (见下文)

It's possible to do better by further unrolling this loop. (see below)

这里的想法如下:


  1. 将8条短裤装入一个SSE寄存器中。

  2. 将寄存器分成两部分:一个具有底部4个短裤,另一个具有顶部4个短裤。

  3. 将两个寄存器零扩展为32位整数。

  4. 将它们都转换为 float s。

  5. 乘以该因子。

  6. 将它们存储在目的地中。

  1. Load 8 shorts into a single SSE register.
  2. Split the register into two: One with the bottom 4 shorts and the other with the top 4 shorts.
  3. Zero-extend both registers into 32-bit integers.
  4. Convert them both to floats.
  5. Multiply by the factor.
  6. Store them into destination.






编辑:

自从我完成这种优化以来已经有一段时间了,所以我继续进行并展开了循环。

It's been a while since I've done this type of optimization, so I went ahead and unrolled the loops.

Core i7 920 @ 3.5 GHz

Visual Studio 2012-版本x64:

Original Loop      : 4.374 seconds
Vectorize no unroll: 1.665
Vectorize unroll 2 : 1.416

进一步展开

这里是TES t代码:

Here's the test code:

#include <smmintrin.h>
#include <time.h>
#include <iostream>
#include <malloc.h>
using namespace std;


void default_loop(float *destination,const short* source,float value,int size){
    float factor = 1.0f / value; 
    for (int i = 0; i < size; i++)
    {
        int value = source[i];
        destination[i] = value*factor;
    }
}
void vectorize8_unroll1(float *destination,const short* source,float value,int size){
    __m128 factor = _mm_set1_ps(1.0f / value);
    for (int i = 0; i < size; i += 8)
    {
        //  Load 8 16-bit ushorts.
        __m128i vi = _mm_load_si128((const __m128i*)(source + i));

        //  Convert to 32-bit integers
        __m128i vi0 = _mm_cvtepu16_epi32(vi);
        __m128i vi1 = _mm_cvtepu16_epi32(_mm_unpackhi_epi64(vi,vi));

        //  Convert to float
        __m128 vf0 = _mm_cvtepi32_ps(vi0);
        __m128 vf1 = _mm_cvtepi32_ps(vi1);

        //  Multiply
        vf0 = _mm_mul_ps(vf0,factor);
        vf1 = _mm_mul_ps(vf1,factor);

        //  Store
        _mm_store_ps(destination + i + 0,vf0);
        _mm_store_ps(destination + i + 4,vf1);
    }
}
void vectorize8_unroll2(float *destination,const short* source,float value,int size){
    __m128 factor = _mm_set1_ps(1.0f / value);
    for (int i = 0; i < size; i += 16)
    {
        __m128i a0 = _mm_load_si128((const __m128i*)(source + i + 0));
        __m128i a1 = _mm_load_si128((const __m128i*)(source + i + 8));

        //  Split into two registers
        __m128i b0 = _mm_unpackhi_epi64(a0,a0);
        __m128i b1 = _mm_unpackhi_epi64(a1,a1);

        //  Convert to 32-bit integers
        a0 = _mm_cvtepu16_epi32(a0);
        b0 = _mm_cvtepu16_epi32(b0);
        a1 = _mm_cvtepu16_epi32(a1);
        b1 = _mm_cvtepu16_epi32(b1);

        //  Convert to float
        __m128 c0 = _mm_cvtepi32_ps(a0);
        __m128 d0 = _mm_cvtepi32_ps(b0);
        __m128 c1 = _mm_cvtepi32_ps(a1);
        __m128 d1 = _mm_cvtepi32_ps(b1);

        //  Multiply
        c0 = _mm_mul_ps(c0,factor);
        d0 = _mm_mul_ps(d0,factor);
        c1 = _mm_mul_ps(c1,factor);
        d1 = _mm_mul_ps(d1,factor);

        //  Store
        _mm_store_ps(destination + i +  0,c0);
        _mm_store_ps(destination + i +  4,d0);
        _mm_store_ps(destination + i +  8,c1);
        _mm_store_ps(destination + i + 12,d1);
    }
}
void print_sum(const float *destination,int size){
    float sum = 0;
    for (int i = 0; i < size; i++){
        sum += destination[i];
    }
    cout << sum << endl;
}

int main(){

    int size = 8000;

    short *source       = (short*)_mm_malloc(size * sizeof(short), 16);
    float *destination  = (float*)_mm_malloc(size * sizeof(float), 16);

    for (int i = 0; i < size; i++){
        source[i] = i;
    }

    float value = 1.1;

    int iterations = 1000000;
    clock_t start;

    //  Default Loop
    start = clock();
    for (int it = 0; it < iterations; it++){
        default_loop(destination,source,value,size);
    }
    cout << (double)(clock() - start) / CLOCKS_PER_SEC << endl;
    print_sum(destination,size);

    //  Vectorize 8, no unroll
    start = clock();
    for (int it = 0; it < iterations; it++){
        vectorize8_unroll1(destination,source,value,size);
    }
    cout << (double)(clock() - start) / CLOCKS_PER_SEC << endl;
    print_sum(destination,size);

    //  Vectorize 8, unroll 2
    start = clock();
    for (int it = 0; it < iterations; it++){
        vectorize8_unroll2(destination,source,value,size);
    }
    cout << (double)(clock() - start) / CLOCKS_PER_SEC << endl;
    print_sum(destination,size);

    _mm_free(source);
    _mm_free(destination);

    system("pause");
}

这篇关于加快浮球的时间?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆