快速 24 位数组 ->32位数组转换? [英] Fast 24-bit array -> 32-bit array conversion?

查看:21
本文介绍了快速 24 位数组 ->32位数组转换?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

快速总结:

我有一个 24 位值的数组.关于如何将单个 24 位数组元素快速扩展为 32 位元素的任何建议?

I have an array of 24-bit values. Any suggestion on how to quickly expand the individual 24-bit array elements into 32-bit elements?

详细信息:

我正在使用 DirectX 10 中的像素着色器实时处理传入的视频帧.一个绊脚石是我的帧来自具有 24 位像素(YUV 或 RGB 图像)的捕获硬件,但 DX10 需要32 位像素纹理.因此,我必须将 24 位值扩展为 32 位,然后才能将它们加载到 GPU 中.

I'm processing incoming video frames in realtime using Pixel Shaders in DirectX 10. A stumbling block is that my frames are coming in from the capture hardware with 24-bit pixels (either as YUV or RGB images), but DX10 takes 32-bit pixel textures. So, I have to expand the 24-bit values to 32-bits before I can load them into the GPU.

我真的不在乎我将剩余的 8 位设置为什么,或者传入的 24 位在 32 位值中的位置 - 我可以在像素着色器中修复所有这些.但是我需要真的快速地从 24 位转换到 32 位.

I really don't care what I set the remaining 8 bits to, or where the incoming 24-bits are in that 32-bit value - I can fix all that in a pixel shader. But I need to do the conversion from 24-bit to 32-bit really quickly.

我对 SIMD SSE 操作不是很熟悉,但粗略地看,我似乎无法使用它们进行扩展,因为我的读取和写入的大小不同.有什么建议?还是我一直在按顺序按摩这个数据集?

I'm not terribly familiar with SIMD SSE operations, but from my cursory glance it doesn't look like I can do the expansion using them, given my reads and writes aren't the same size. Any suggestions? Or am I stuck sequentially massaging this data set?

这感觉非常愚蠢 - 我使用像素着色器进行并行处理,但在此之前我必须执行逐像素的顺序操作.我必须遗漏了一些明显的东西......

This feels so very silly - I'm using the pixel shaders for parallelism, but I have to do a sequential per-pixel operation before that. I must be missing something obvious...

推荐答案

下面的代码应该很快.它在每次迭代中复制 4 个像素,仅使用 32 位读/写指令.源指针和目标指针应对齐到 32 位.

The code below should be pretty fast. It copies 4 pixels in each iteration, using only 32-bit read/write instructions. The source and destination pointers should be aligned to 32 bits.

uint32_t *src = ...;
uint32_t *dst = ...;

for (int i=0; i<num_pixels; i+=4) {
    uint32_t sa = src[0];
    uint32_t sb = src[1];
    uint32_t sc = src[2];

    dst[i+0] = sa;
    dst[i+1] = (sa>>24) | (sb<<8);
    dst[i+2] = (sb>>16) | (sc<<16);
    dst[i+3] = sc>>8;

    src += 3;
}

这是一种使用 SSSE3 指令 PSHUFB 和 PALIGNR 执行此操作的方法.代码是使用编译器内在函数编写的,但如果需要,将其转换为汇编应该不难.它在每次迭代中复制 16 个像素.源和目标指针必须对齐到 16 字节,否则会出错.如果它们没有对齐,您可以通过将 _mm_load_si128 替换为 _mm_loadu_si128 并将 _mm_store_si128 替换为 _mm_storeu_si128 来使其工作,但这会更慢.

Here is a way to do this using the SSSE3 instructions PSHUFB and PALIGNR. The code is written using compiler intrinsics, but it shouldn't be hard to translate to assembly if needed. It copies 16 pixels in each iteration. The source and destination pointers Must be aligned to 16 bytes, or it will fault. If they aren't aligned, you can make it work by replacing _mm_load_si128 with _mm_loadu_si128 and _mm_store_si128 with _mm_storeu_si128, but this will be slower.

#include <emmintrin.h>
#include <tmmintrin.h>

__m128i *src = ...;
__m128i *dst = ...;
__m128i mask = _mm_setr_epi8(0,1,2,-1, 3,4,5,-1, 6,7,8,-1, 9,10,11,-1);

for (int i=0; i<num_pixels; i+=16) {
    __m128i sa = _mm_load_si128(src);
    __m128i sb = _mm_load_si128(src+1);
    __m128i sc = _mm_load_si128(src+2);

    __m128i val = _mm_shuffle_epi8(sa, mask);
    _mm_store_si128(dst, val);
    val = _mm_shuffle_epi8(_mm_alignr_epi8(sb, sa, 12), mask);
    _mm_store_si128(dst+1, val);
    val = _mm_shuffle_epi8(_mm_alignr_epi8(sc, sb, 8), mask);
    _mm_store_si128(dst+2, val);
    val = _mm_shuffle_epi8(_mm_alignr_epi8(sc, sc, 4), mask);
    _mm_store_si128(dst+3, val);

    src += 3;
    dst += 4;
}

SSSE3(不要与 SSE3 混淆)将需要一个相对较新的处理器:Core 2 或更新版本,我相信 AMD 尚不支持它.仅使用 SSE2 指令执行此操作将需要更多操作,并且可能不值得.

SSSE3 (not to be confused with SSE3) will require a relatively new processor: Core 2 or newer, and I believe AMD doesn't support it yet. Performing this with SSE2 instructions only will take a lot more operations, and may not be worth it.

这篇关于快速 24 位数组 ->32位数组转换?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆