快速的24位阵列 - > 32位阵列的转换? [英] Fast 24-bit array -> 32-bit array conversion?

查看:306
本文介绍了快速的24位阵列 - > 32位阵列的转换?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

快速摘​​要:

Quick Summary:

我有24位值的数组。如何快速扩大个人24位的数组元素成32位元素任何建议?

I have an array of 24-bit values. Any suggestion on how to quickly expand the individual 24-bit array elements into 32-bit elements?

详细内容:

Details:

我使用的处理像素着色器在DirectX 10绊脚石是,我的帧从采集硬件在未来的24位像素(无论是作为YUV或RGB图像)的实时输入的视频帧,但需要DX10 32位像素的纹理。所以,我必须展开24位值到32位之前,我可以将它们加载到GPU。

I'm processing incoming video frames in realtime using Pixel Shaders in DirectX 10. A stumbling block is that my frames are coming in from the capture hardware with 24-bit pixels (either as YUV or RGB images), but DX10 takes 32-bit pixel textures. So, I have to expand the 24-bit values to 32-bits before I can load them into the GPU.

我真的不在乎我的设置,其余8位,或者输入24位在32位值 - 我能解决所有在像素着色器。但是,我需要做的转化率从24位到32位的真正的快。

I really don't care what I set the remaining 8 bits to, or where the incoming 24-bits are in that 32-bit value - I can fix all that in a pixel shader. But I need to do the conversion from 24-bit to 32-bit really quickly.

我并不十分熟悉SIMD SSE操作,但是从我粗略地看一眼它看起来并不像我所能做的使用它们的扩展,给我的读取和写入操作是不一样的大小。有什么建议么?还是我卡住依次按摩这组数据?

I'm not terribly familiar with SIMD SSE operations, but from my cursory glance it doesn't look like I can do the expansion using them, given my reads and writes aren't the same size. Any suggestions? Or am I stuck sequentially massaging this data set?

这感觉很无聊 - 我使用了像素着色器的并行,但我必须在这之前做了连续的每个像素的操作。我的必须的是缺少明显的东西...

This feels so very silly - I'm using the pixel shaders for parallelism, but I have to do a sequential per-pixel operation before that. I must be missing something obvious...

推荐答案

下code应该是pretty快。它复制4个像素在每次迭代中,只使用32位的读/写指令。源和目标指针应该调整到32位。

The code below should be pretty fast. It copies 4 pixels in each iteration, using only 32-bit read/write instructions. The source and destination pointers should be aligned to 32 bits.

uint32_t *src = ...;
uint32_t *dst = ...;

for (int i=0; i<num_pixels; i+=4) {
    uint32_t sa = src[0];
    uint32_t sb = src[1];
    uint32_t sc = src[2];

    dst[i+0] = sa;
    dst[i+1] = (sa>>24) | (sb<<8);
    dst[i+2] = (sb>>16) | (sc<<16);
    dst[i+3] = sc>>8;

    src += 3;
}

编辑:

下面是一个办法做到这一点使用SSSE3指令PSHUFB和PALIGNR。在code使用编译器内在函数编写的,但它不应该是很难,如果需要翻译到组装。它复制16个像素在每个迭代。源和目标指针的必须的对齐到16字节,否则会发生故障。如果它们不对齐,你可以把它更换工作 _mm_load_si128 _mm_loadu_si128 _mm_store_si128 _mm_storeu_si128 ,但这个速度会变慢。

Here is a way to do this using the SSSE3 instructions PSHUFB and PALIGNR. The code is written using compiler intrinsics, but it shouldn't be hard to translate to assembly if needed. It copies 16 pixels in each iteration. The source and destination pointers Must be aligned to 16 bytes, or it will fault. If they aren't aligned, you can make it work by replacing _mm_load_si128 with _mm_loadu_si128 and _mm_store_si128 with _mm_storeu_si128, but this will be slower.

#include <emmintrin.h>
#include <tmmintrin.h>

__m128i *src = ...;
__m128i *dst = ...;
__m128i mask = _mm_setr_epi8(0,1,2,-1, 3,4,5,-1, 6,7,8,-1, 9,10,11,-1);

for (int i=0; i<num_pixels; i+=16) {
    __m128i sa = _mm_load_si128(src);
    __m128i sb = _mm_load_si128(src+1);
    __m128i sc = _mm_load_si128(src+2);

    __m128i val = _mm_shuffle_epi8(sa, mask);
    _mm_store_si128(dst, val);
    val = _mm_shuffle_epi8(_mm_alignr_epi8(sb, sa, 12), mask);
    _mm_store_si128(dst+1, val);
    val = _mm_shuffle_epi8(_mm_alignr_epi8(sc, sb, 8), mask);
    _mm_store_si128(dst+2, val);
    val = _mm_shuffle_epi8(_mm_alignr_epi8(sc, sc, 4), mask);
    _mm_store_si128(dst+3, val);

    src += 3;
    dst += 4;
}

SSSE3(不要与SSE3混淆)将需要一个相对较新的处理器:酷睿2或更高版本,相信AMD不支持它。与SSE2指令执行此仅将采取更多的行动,并且可能并不值得。

SSSE3 (not to be confused with SSE3) will require a relatively new processor: Core 2 or newer, and I believe AMD doesn't support it yet. Performing this with SSE2 instructions only will take a lot more operations, and may not be worth it.

这篇关于快速的24位阵列 - &GT; 32位阵列的转换?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆