警告C4799:函数没有EMMS指令 [英] warning C4799: function has no EMMS instruction

查看:123
本文介绍了警告C4799:函数没有EMMS指令的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试创建C#应用程序,该应用程序使用包含C ++代码和内联汇编的dll库.在函数test_MMX中,我想添加两个特定长度的数组.

 extern "C" __declspec(dllexport) void __stdcall test_MMX(int *first_array,int *second_array,int length)
{
    __asm
    {
         mov ecx,length;
         mov esi,first_array;
         shr ecx,1;
         mov edi,second_array;
     label:
         movq mm0,QWORD PTR[esi];
         paddd mm0,QWORD PTR[edi];
         add edi,8;
         movq QWORD PTR[esi],mm0;
         add esi,8;
         dec ecx;
         jnz label;
     }
}
 

运行应用后,它将显示以下警告:

警告C4799:函数'test_MMX'没有EMMS指令.

当我想以毫秒为单位测量运行此函数C#的时间时,它将返回以下值:-922337203685477而不是(例如0,0141)...

private Stopwatch time = new Stopwatch();
time.Reset();
time.Start();
test_MMX(first_array, second_array, length);
time.Stop();
TimeSpan interval = time.Elapsed;
return trvanie.TotalMilliseconds;

请问如何解决它的任何想法?

解决方案

由于MMX是浮点寄存器的别名,所以任何使用MMX指令的例程都必须以EMMS指令结尾. 此指令清除"寄存器,使它们一次可用于x87 FPU再次. (x86的任何C或C ++调用约定都将是安全的.)

编译器警告您,您编写的例程使用MMX指令,但不是EMMS指令结尾.一旦某些FPU指令尝试执行,这就是一个等待发生的错误.

这是MMX的一个巨大缺点,也是您真正不能随意混合MMX和浮点指令的原因.当然,您可以只抛出EMMS指令,但这是一条缓慢的,高延迟的指令,因此会降低性能.在这方面,至少对于整数运算,SSE具有与MMX相同的限制. SSE2是解决此问题的第一条指令集,因为它使用了自己的离散寄存器集.其寄存器的宽度也是MMX的两倍,因此您一次可以做更多的事情.由于SSE2可以完成MMX的所有工作,而且速度更快,更轻松,更高效,并且受到Pentium 4和更高版本的支持,因此很少有人需要编写使用MMX的新代码.如果可以使用SSE2,则应该使用.它将比MMX更快.不使用MMX的另一个原因是在64位模式下不支持它.

无论如何,编写MMX代码的正确方法是:

__asm
{
     mov   ecx, [length]
     mov   eax, [first_array]
     shr   ecx, 1
     mov   edx, [second_array]
 label:
     movq  mm0, QWORD PTR [eax]
     paddd mm0, QWORD PTR [edx]
     add   edx, 8
     movq  QWORD PTR [eax], mm0
     add   eax, 8
     dec   ecx
     jnz   label
     emms
 }

请注意,除了EMMS指令(当然,该指令位于循环的 之外),我还做了一些其他更改:

  • 汇编语言说明以分号结尾.实际上,在汇编语言的语法中,使用分号来开始注释.所以我删除了您的分号.
  • 我还添加了可读性空格.
  • 而且,虽然这不是绝对必要的(Microsoft的内联汇编器足够宽容,以使您可以摆脱 not 的习惯),但最好是进行明确的包装由于实际上是取消引用它们,因此请在方括号中使用地址(C/C ++变量).
  • 正如注释者所指出的,您可以在内联汇编中自由使用ESIEDI寄存器,因为内联汇编器将检测它们的使用并生成相应的指令来相应地推送/弹出它们.实际上,它将使用 all 非易失性寄存器来执行此操作.而且,如果您需要其他寄存器,那么就需要它们,这是一个不错的功能.但是在这段代码中,您仅使用了三个通用寄存器,并且在__stdcall调用约定中,有三个通用寄存器被专门定义为volatile( ie 任意功能随意破坏):EAXEDXECX.因此,您应该使用那些寄存器来获得最大速度.因此,我将您将ESI的使用更改为EAX,并将您将EDI的使用更改为EDX.这将改善您看不到的代码,即编译器自动生成的序言和结语.

不过,您可能会潜伏在这里,那就是 alignment .为了获得最大速度,MMX指令需要对在8字节边界上对齐的数据进行操作.在循环中,未对齐的数据会对性能产生不利的复合影响:不仅数据第一次在循环中未对齐,从而造成明显的性能损失,而且还保证在随后的每次循环中都未对齐.因此,为了使此代码有更快的机会,调用者需要确保first_arraysecond_array在8字节边界上对齐.

如果您不能保证这一点,那么该函数实际上应该添加了额外的代码以解决未对齐问题.本质上,在开始循环之前,您要在开始时执行几个非向量操作(在单个字节上),直到达到合适的对齐方式为止.然后,您可以开始发布矢量化的MMX指令.

(未对齐的负载不再在现代处理器上受到惩罚,但是如果您以现代处理器为目标,那么您将在编写SSE2代码.在需要运行MMX代码的较旧处理器上,对齐将非常重要,并且未对齐的数据会降低您的性能.)

现在,此内联程序集将不会产生特别有效的代码.使用内联汇编时,编译器始终会为该函数生成序言和结语代码.这并不可怕,因为它位于关键的内部循环之外,但仍然—您不需要的东西.更糟糕的是,内联汇编块中的跳转往往会混淆MSVC的内联汇编器,并导致其生成次优代码.这样做非常谨慎,它会阻止您执行可能破坏堆栈或引起其他外部副作用的操作,这很好,只是您编写内联汇编的全部原因(大概)是因为您希望获得最佳性能.

(不用说,但是如果您不需要需要最大的性能,则应该只用C(或C ++)编写代码,然后让编译器对其进行优化.在大多数情况下都做得不好.)

如果 do 需要最大的性能,并且已确定编译器生成的代码不会削减代码,那么内联汇编的更好替代方法是使用 intrinsics .通常,内部函数会一对一地映射到汇编语言指令,但是编译器围绕它们进行的优化要好得多.

这是使用MMX内在函数的我的代码版本:

#include <intrin.h>   // include header with MMX intrinsics


void __stdcall Function_With_Intrinsics(int *first_array, int *second_array, int length)
{
   unsigned int counter = static_cast<unsigned int>(length);
   counter /= 2;
   do
   {
      *reinterpret_cast<__m64*>(first_array) = _mm_add_pi32(*reinterpret_cast<const __m64*>(first_array),
                                                            *reinterpret_cast<const __m64*>(second_array));
      first_array  += 8;
      second_array += 8;
   } while (--counter != 0);
   _mm_empty();
}

它做同样的事情,但是通过将更多的工作委托给编译器的优化器来提高效率.一些注意事项:

  1. 由于您的汇编代码将length视为无符号整数,因此我假设您的接口要求实际上是 一个无符号整数. (而且,如果是这样,我想知道为什么您不在函数的签名中这样声明它.)为了达到相同的效果,我将其转换为unsigned int,随后将其用作counter. (如果我没有这样做,我要么必须对有符号整数进行移位操作(这会带来不确定的行为),要么必须进行二除,编译器会为此生成较慢的代码来正确处理符号位.)
  2. 分散在各处的*reinterpret_cast<__m64*>业务看起来很恐怖,但实际上是安全的-至少相对而言.那就是您应该使用MMX内部函数所做的. MMX数据类型为__m64,您可以将其视为大致等同于mm?寄存器.它的长度为64位,并且加载和存储是通过强制转换完成的.这些直接翻译成MOVQ指令.
  3. 您的原始汇编代码编写为,使得循环始终至少重复一次,因此我将其转换为dowhile循环.这意味着对循环条件的测试只需在循环的底部完成,而无需在顶部进行一次,而在底部进行一次.
  4. _mm_empty()内部函数导致发出EMMS指令.

仅此而已,让我们看看编译器将其转换为什么.这是MSVC 16(VS 2010)的输出,针对x86-32并针对大小进行了速度优化(尽管在这种特定情况下没有区别):

PUBLIC  ?Function_With_Intrinsics@@YGXPAH0H@Z
; Function compile flags: /Ogtpy
_first_array$  = 8                  ; size = 4
_second_array$ = 12             ; size = 4
_length$       = 16             ; size = 4
?Function_With_Intrinsics@@YGXPAH0H@Z PROC
    mov    ecx, DWORD PTR _length$[esp-4]
    mov    edx, DWORD PTR _second_array$[esp-4]
    mov    eax, DWORD PTR _first_array$[esp-4]
    shr    ecx, 1
    sub    edx, eax
$LL3:
    movq   mm0, MMWORD PTR [eax]
    movq   mm1, MMWORD PTR [edx+eax]
    paddd  mm0, mm1
    movq   MMWORD PTR [eax], mm0
    add    eax, 32
    dec    ecx
    jne    SHORT $LL3
    emms
    ret    12
?Function_With_Intrinsics@@YGXPAH0H@Z ENDP

可以识别的是它与原始代码相似,但是在某些方面做的不同.特别是,它以不同的方式跟踪数组指针,(我相信)它比原始代码更有效,因为它在循环内的工作量较少.它还会分解您的PADDD指令,以便其两个操作数都是MMX寄存器,而不是源是内存操作数.同样,这倾向于使代码更高效,但会浪费更多的MMX寄存器,但是我们还有很多可以节省的地方,因此当然值得.

更好的是,随着优化器在较新版本的编译器中的改进,使用内在函数编写的代码甚至可能更胜一筹.

当然,重写该函数以使用内部函数并不能解决对齐问题,但是我假设您已经在调用方进行了处理.如果没有,则需要添加代码来处理它.

如果您想使用SSE2(也许是test_SSE2),并且根据当前处理器的功能位动态地委派给适当的实现,那么您可以这样做:

#include <intrin.h>   // include header with SSE2 intrinsics


void __stdcall Function_With_Intrinsics_SSE2(int *first_array, int *second_array, int length)
{
   unsigned int counter = static_cast<unsigned>(length);
   counter /= 4;
   do
   {
      _mm_storeu_si128(reinterpret_cast<__m128i*>(first_array),
                       _mm_add_epi32(_mm_loadu_si128(reinterpret_cast<const __m128i*>(first_array)),
                                     _mm_loadu_si128(reinterpret_cast<const __m128i*>(second_array))));
      first_array  += 16;
      second_array += 16;
   } while (--counter != 0);
}

我已编写此代码 not 并假定它们是对齐的,因此,当装入和存储未对齐时,它将起作用.为了在许多较旧的体系结构上以最快的速度运行,SSE2需要16字节对齐,并且如果可以保证源指针和目标指针如此对齐,则可以使用稍快的指令( eg MOVDQA相对于MOVDQU).如上所述,在较新的体系结构(至少是Sandy Bridge以及以后的版本,也许是更早的版本)上,没关系.

要使您了解SSE2基本上只是奔腾4及更高版本上的MMX的直接替代品,除了您还可以进行两倍宽的操作外,请查看编译为以下内容的代码:

PUBLIC  ?Function_With_Intrinsics_SSE2@@YGXPAH0H@Z
; Function compile flags: /Ogtpy
_first_array$  = 8                  ; size = 4
_second_array$ = 12             ; size = 4
_length$       = 16             ; size = 4
?Function_With_Intrinsics_SSE2@@YGXPAH0H@Z PROC
    mov     ecx, DWORD PTR _length$[esp-4]
    mov     edx, DWORD PTR _second_array$[esp-4]
    mov     eax, DWORD PTR _first_array$[esp-4]
    shr     ecx, 2
    sub     edx, eax
$LL3:
    movdqu  xmm0, XMMWORD PTR [eax]
    movdqu  xmm1, XMMWORD PTR [edx+eax]
    paddd   xmm0, xmm1
    movdqu  XMMWORD PTR [eax], xmm0
    add     eax, 64
    dec     ecx
    jne     SHORT $LL3
    ret     12
?Function_With_Intrinsics_SSE2@@YGXPAH0H@Z ENDP

关于从.NET Stopwatch类获取负值的最后一个问题,我通常会猜想这是由于溢出造成的.换句话说,您的代码执行速度太慢,并且计时器缠住了.但是,凯文·高斯(Kevin Gosse)指出,这显然是一个错误Stopwatch的实现.我对此并不了解,因为我并不真正使用它.如果您想要一个好的微基准测试库,请使用并推荐 Google基准.但是,它适用于C ++,而不适用于C#.

在进行基准测试时,绝对要花些时间来编写由天真的方式编写的编译器生成的代码.说,像这样:

void Naive_PackedAdd(int *first_array, int *second_array, int length)
{
   for (unsigned int i = 0; i < static_cast<unsigned int>(length); ++i)
   {
      first_array[i] += second_array[i];
   }
}

您可能会对编译器完成后代码的运行速度感到惊讶,自动向量化循环. :-)请记住,更少的代码并不一定意味着更快的代码.处理对齐问题需要所有这些额外的代码,我在整个回答中都以外交方式略过了这些问题.如果向下滚动,在$LL4@Naive_Pack处,您将找到一个内部循环,该循环与我们在此处考虑的内容非常相似.

I'm trying to create C# app which uses dll library which contains C++ code and inline assembly. In function test_MMX I want to add two arrays of specific length.

extern "C" __declspec(dllexport) void __stdcall test_MMX(int *first_array,int *second_array,int length)
{
    __asm
    {
         mov ecx,length;
         mov esi,first_array;
         shr ecx,1;
         mov edi,second_array;
     label:
         movq mm0,QWORD PTR[esi];
         paddd mm0,QWORD PTR[edi];
         add edi,8;
         movq QWORD PTR[esi],mm0;
         add esi,8;
         dec ecx;
         jnz label;
     }
}

After run app it's showing this warning:

warning C4799: function 'test_MMX' has no EMMS instruction.

When I want to measure time of running this function C# in miliseconds it returns this value: -922337203685477 instead of (for example 0,0141)...

private Stopwatch time = new Stopwatch();
time.Reset();
time.Start();
test_MMX(first_array, second_array, length);
time.Stop();
TimeSpan interval = time.Elapsed;
return trvanie.TotalMilliseconds;

Any ideas how to fix it please ?

解决方案

Since MMX aliases over the floating-point registers, any routine that uses MMX instructions must end with the EMMS instruction. This instruction "clears" the registers, making them available for use by the x87 FPU once again. (Which any C or C++ calling convention for x86 will assume is safe.)

The compiler is warning you that you have written a routine that uses MMX instructions but does not end with the EMMS instruction. That's a bug waiting to happen, as soon as some FPU instruction tries to execute.

This is a huge disadvantage of MMX, and the reason why you really can't freely intermix MMX and floating-point instructions. Sure, you could just throw EMMS instructions around, but it is a slow, high-latency instruction, so this kills performance. SSE had the same limitations as MMX in this regard, at least for integer operations. SSE2 was the first instruction set to address this problem, since it used its own discrete register set. Its registers are also twice as wide as MMX's are, so you can do even more at a time. Since SSE2 does everything that MMX does, but faster, easier, and more efficiently, and is supported by the Pentium 4 and later, it is quite rare that anyone needs to write new code today that uses MMX. If you can use SSE2, you should. It will be faster than MMX. Another reason not to use MMX is that it is not supported in 64-bit mode.

Anyway, the correct way to write the MMX code would be:

__asm
{
     mov   ecx, [length]
     mov   eax, [first_array]
     shr   ecx, 1
     mov   edx, [second_array]
 label:
     movq  mm0, QWORD PTR [eax]
     paddd mm0, QWORD PTR [edx]
     add   edx, 8
     movq  QWORD PTR [eax], mm0
     add   eax, 8
     dec   ecx
     jnz   label
     emms
 }

Note that, in addition to the EMMS instruction (which, of course, is placed outside of the loop), I made a few additional changes:

  • Assembly-language instructions do not end with semicolons. In fact, in assembly language's syntax, the semicolon is used to begin a comment. So I have removed your semicolons.
  • I've also added spaces for readability.
  • And, while it isn't strictly necessary (Microsoft's inline assembler is sufficiently forgiving so as to allow you to get away with not doing it), it is a good idea to be explicit and wrap the use of addresses (C/C++ variables) in square brackets, since you are actually dereferencing them.
  • As a commenter pointed out, you can freely use the ESI and EDI registers in inline assembly, since the inline assembler will detect their use and generate additional instructions that push/pop them accordingly. In fact, it will do this with all non-volatile registers. And if you need additional registers, then you need them, and this is a nice feature. But in this code, you're only using three general-purpose registers, and in the __stdcall calling convention, there are three general-purpose registers that are specifically defined as volatile (i.e., can be freely clobbered by any function): EAX, EDX, and ECX. So you should be using those registers for maximum speed. As such, I've changed your use of ESI to EAX, and your use of EDI to EDX. This will improve the code that you can't see, the prologue and epilogue automatically generated by the compiler.

You have a potential speed trap lurking here, though, and that is alignment. To obtain maximum speed, MMX instructions need to operate on data that is aligned on 8-byte boundaries. In a loop, misaligned data has a compounding negative effect on performance: not only is the data misaligned the first time through the loop, exerting a significant performance penalty, but it is guaranteed to be misaligned each subsequent time through the loop, too. So for this code to have any chance of being fast, the caller needs to guarantee that first_array and second_array are aligned on 8-byte boundaries.

If you can't guarantee that, then the function should really have extra code added to it to fix up misalignments. Essentially, you want to do a couple of non-vector operations (on individual bytes) at the beginning, before starting the loop, until you've reached a suitable alignment. Then, you can start issuing the vectorized MMX instructions.

(Unaligned loads are no longer penalized on modern processors, but if you were targeting modern processors, you'd be writing SSE2 code. On the older processors where you need to run MMX code, alignment will be a big deal, and misaligned data will kill your performance.)

Now, this inline assembly won't produce particularly efficient code. When you use inline assembly, the compiler always generates prologue and epilogue code for the function. That isn't terrible, since it's outside of the critical inner loop, but still—it's cruft you don't need. Worse, jumps in inline assembly blocks tend to confuse MSVC's inline assembler and cause it to generate sub-optimal code. It is overly cautious, preventing you from doing something that could corrupt the stack or cause other external side effects, which is nice, except that the whole reason you're writing inline assembly is (presumably) because you desire maximum performance.

(It should go without saying, but if you don't need the maximum possible performance, you should just write the code in C (or C++) and let the compiler optimize it. It does a darn good job in the majority of cases.)

If you do need the maximum possible performance, and have decided that the compiler-generated code just won't cut it, then a better alternative to inline assembly is the use of intrinsics. Intrinsics will generally map one-to-one to assembly-language instructions, but the compiler does a lot better job optimizing around them.

Here's my version of your code, using MMX intrinsics:

#include <intrin.h>   // include header with MMX intrinsics


void __stdcall Function_With_Intrinsics(int *first_array, int *second_array, int length)
{
   unsigned int counter = static_cast<unsigned int>(length);
   counter /= 2;
   do
   {
      *reinterpret_cast<__m64*>(first_array) = _mm_add_pi32(*reinterpret_cast<const __m64*>(first_array),
                                                            *reinterpret_cast<const __m64*>(second_array));
      first_array  += 8;
      second_array += 8;
   } while (--counter != 0);
   _mm_empty();
}

It does the same thing, but more efficiently by delegating more to the compiler's optimizer. A couple of notes:

  1. Since your assembly code treats length as an unsigned integer, I assume that your interface requires that it actually be an unsigned integer. (And, if so, I wonder why you don't declare it as such in the function's signature.) To achieve the same effect, I've cast it to an unsigned int, which is subsequently used as the counter. (If I hadn't done that, I'd have to have either done a shift operation on a signed integer, which risks undefined behavior, or a division by two, for which the compiler would have generated slower code to correctly deal with the sign bit.)
  2. The *reinterpret_cast<__m64*> business scattered throughout looks scary, but is actually safe—at least, relatively speaking. That's what you're supposed to do with the MMX intrinsics. The MMX data type is __m64, which you can think of as being roughly equivalent to an mm? register. It is 64 bits in length, and loads and stores are accomplished by casting. These get translated directly into MOVQ instructions.
  3. Your original assembly code was written such that the loop always iterated at least once, so I transformed that into a dowhile loop. This means the test of the loop condition only has to be done at the bottom of the loop, rather than once at the top and once at the bottom.
  4. The _mm_empty() intrinsic causes an EMMS instruction to be emitted.

Just for grins, let's see what the compiler transformed this into. This is the output from MSVC 16 (VS 2010), targeting x86-32 and optimizing for speed over size (though it makes no difference in this particular case):

PUBLIC  ?Function_With_Intrinsics@@YGXPAH0H@Z
; Function compile flags: /Ogtpy
_first_array$  = 8                  ; size = 4
_second_array$ = 12             ; size = 4
_length$       = 16             ; size = 4
?Function_With_Intrinsics@@YGXPAH0H@Z PROC
    mov    ecx, DWORD PTR _length$[esp-4]
    mov    edx, DWORD PTR _second_array$[esp-4]
    mov    eax, DWORD PTR _first_array$[esp-4]
    shr    ecx, 1
    sub    edx, eax
$LL3:
    movq   mm0, MMWORD PTR [eax]
    movq   mm1, MMWORD PTR [edx+eax]
    paddd  mm0, mm1
    movq   MMWORD PTR [eax], mm0
    add    eax, 32
    dec    ecx
    jne    SHORT $LL3
    emms
    ret    12
?Function_With_Intrinsics@@YGXPAH0H@Z ENDP

It is recognizably similar to your original code, but does a couple of things differently. In particular, it tracks the array pointers differently, in a way that it (and I) believe is slightly more efficient than your original code, since it does less work inside of the loop. It also breaks apart your PADDD instruction so that both of its operands are MMX registers, instead of the source being a memory operand. Again, this tends to make the code more efficient at the expense of clobbering an additional MMX register, but we've got plenty of those to spare, so it's certainly worth it.

Better yet, as the optimizer improves in newer versions of the compiler, code that is written using intrinsics may get even better!

Of course, rewriting the function to use intrinsics doesn't solve the alignment problem, but I'm assuming you have already dealt with that on the caller side. If not, you'll need to add code to handle it.

If you wanted to use SSE2—perhaps that would be test_SSE2 and you would dynamically delegate to the appropriate implementation depending on the current processor's feature bits—then you could do it like this:

#include <intrin.h>   // include header with SSE2 intrinsics


void __stdcall Function_With_Intrinsics_SSE2(int *first_array, int *second_array, int length)
{
   unsigned int counter = static_cast<unsigned>(length);
   counter /= 4;
   do
   {
      _mm_storeu_si128(reinterpret_cast<__m128i*>(first_array),
                       _mm_add_epi32(_mm_loadu_si128(reinterpret_cast<const __m128i*>(first_array)),
                                     _mm_loadu_si128(reinterpret_cast<const __m128i*>(second_array))));
      first_array  += 16;
      second_array += 16;
   } while (--counter != 0);
}

I've written this code not assuming alignment, so it will work when the loads and stores are misaligned. For maximum speed on many older architectures, SSE2 requires 16-byte alignment, and if you can guarantee that the source and destination pointers are thusly aligned, you can use slightly faster instructions (e.g., MOVDQA as opposed to MOVDQU). As mentioned above, on newer architectures (at least Sandy Bridge and later, perhaps earlier), it doesn't matter.

To give you an idea of how SSE2 is basically just a drop-in replacement for MMX on Pentium 4 and later, except that you also get to do operations that are twice as wide, look at the code this compiles to:

PUBLIC  ?Function_With_Intrinsics_SSE2@@YGXPAH0H@Z
; Function compile flags: /Ogtpy
_first_array$  = 8                  ; size = 4
_second_array$ = 12             ; size = 4
_length$       = 16             ; size = 4
?Function_With_Intrinsics_SSE2@@YGXPAH0H@Z PROC
    mov     ecx, DWORD PTR _length$[esp-4]
    mov     edx, DWORD PTR _second_array$[esp-4]
    mov     eax, DWORD PTR _first_array$[esp-4]
    shr     ecx, 2
    sub     edx, eax
$LL3:
    movdqu  xmm0, XMMWORD PTR [eax]
    movdqu  xmm1, XMMWORD PTR [edx+eax]
    paddd   xmm0, xmm1
    movdqu  XMMWORD PTR [eax], xmm0
    add     eax, 64
    dec     ecx
    jne     SHORT $LL3
    ret     12
?Function_With_Intrinsics_SSE2@@YGXPAH0H@Z ENDP

As for the final question about getting negative values from the .NET Stopwatch class, I would normally guess that would be due to an overflow. In other words, your code executed too slowly, and the timer wrapped around. Kevin Gosse pointed out, though, that this is apparently a bug in the implementation of the Stopwatch class. I don't know much more about it, since I don't really use it. If you want a good microbenchmarking library, I use and recommend Google Benchmark. However, it is for C++, not C#.

While you're benchmarking, definitely take the time to time the code generated by the compiler when you write it the naïve way. Say, something like:

void Naive_PackedAdd(int *first_array, int *second_array, int length)
{
   for (unsigned int i = 0; i < static_cast<unsigned int>(length); ++i)
   {
      first_array[i] += second_array[i];
   }
}

You just might be pleasantly surprised at how fast the code is after the compiler gets finished auto-vectorizing the loop. :-) Remember that less code does not necessarily mean faster code. All of that extra code is required to deal with alignment issues, which I've diplomatically skirted throughout this answer. If you scroll down, at $LL4@Naive_Pack, you'll find an inner loop very similar to what we've been considering here.

这篇关于警告C4799:函数没有EMMS指令的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆