SSE:_mm_load/store 与使用直接指针访问的区别 [英] SSE: Difference between _mm_load/store vs. using direct pointer access
问题描述
假设我要添加两个缓冲区并存储结果.两个缓冲区都已分配为 16 字节对齐.我找到了两个如何做到这一点的例子.
Suppose I want to add two buffers and store the result. Both buffers are already allocated 16byte aligned. I found two examples how to do that.
第一个是使用 _mm_load 将数据从缓冲区读取到 SSE 寄存器中,执行加法操作并存储回结果寄存器.直到现在我都会这样做.
The first one is using _mm_load to read the data from the buffer into an SSE register, does the add operation and stores back to the result register. Until now I would have done it like that.
void _add( uint16_t * dst, uint16_t const * src, size_t n )
{
for( uint16_t const * end( dst + n ); dst != end; dst+=8, src+=8 )
{
__m128i _s = _mm_load_si128( (__m128i*) src );
__m128i _d = _mm_load_si128( (__m128i*) dst );
_d = _mm_add_epi16( _d, _s );
_mm_store_si128( (__m128i*) dst, _d );
}
}
第二个例子只是直接对内存地址进行加法操作,没有进行加载/存储操作.两个接缝都能正常工作.
The second example just did the add operations directly on the memory addresses without load/store operation. Both seam to work fine.
void _add( uint16_t * dst, uint16_t const * src, size_t n )
{
for( uint16_t const * end( dst + n ); dst != end; dst+=8, src+=8 )
{
*(__m128i*) dst = _mm_add_epi16( *(__m128i*) dst, *(__m128i*) src );
}
}
所以问题是第二个例子是否正确或可能有任何副作用,何时使用加载/存储是强制性的.
So the question is if the 2nd example is correct or may have any side effects and when to use load/store is mandatory.
谢谢.
推荐答案
两个版本都很好 - 如果您查看生成的代码,您会发现第二个版本仍然至少生成一个向量寄存器的负载,因为 PADDW
(又名 _mm_add_epi16
)只能直接从内存中获取第二个参数.
Both versions are fine - if you look at the generated code you will see that the second version still generates at least one load to a vector register, since PADDW
(aka _mm_add_epi16
) can only get its second argument directly from memory.
实际上,大多数非平凡的 SIMD 代码会在加载和存储数据之间执行更多操作,而不仅仅是单个添加,因此通常您可能希望使用 _mm_load_XXX<将数据最初加载到向量变量(寄存器)中/code>,对寄存器执行所有 SIMD 操作,然后通过
_mm_store_XXX
将结果存储回内存.
In practice most non-trivial SIMD code will do a lot more operations between loading and storing data than just a single add, so in general you probably want to load data initially to vector variables (registers) using _mm_load_XXX
, perform all your SIMD operations on registers, then store the results back to memory via _mm_store_XXX
.
这篇关于SSE:_mm_load/store 与使用直接指针访问的区别的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!