与上证所内部函数性能 [英] performance of intrinsic functions with sse

查看：123 发布时间：2016/8/22 14:34:08 c sse

本文介绍了与上证所内部函数性能的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我目前入门证。
这个问题的答案我的previous问题有关SSE（通过不断使用SSE < Mutiplying矢量/一>）给我带来的想法，以测试使用内部函数如 _mm_mul_ps（）键，只用正常运营的区别（不知道最好的词是什么）像 * 。

I am currently getting started with SSE. The answer to my previous question regarding SSE ( Mutiplying vector by constant using SSE ) brought me to the idea to test the difference between using intrinsics like _mm_mul_ps()and just using 'normal operators' (not sure what the best term is) like *.

所以我写了两个测试的情况下，只有在方式上不同结果的计算：结果
方法1：

So i wrote two testing cases which only differ in way the result is calculated:
Method 1:

int main(void){
    float4 a, b, c;

    a.v = _mm_set_ps(1.0f, 2.0f, 3.0f, 4.0f);
    b.v = _mm_set_ps(-1.0f, -2.0f, -3.0f, -4.0f);

    printf("method 1\n");
    c.v = a.v + b.v;      // <---
    print_vector(a);
    print_vector(b);
    printf("1.a) Computed output 1: ");
    print_vector(c);

    exit(EXIT_SUCCESS);
}

方法2：

int main(void){
    float4 a, b, c;

    a.v = _mm_set_ps(1.0f, 2.0f, 3.0f, 4.0f);
    b.v = _mm_set_ps(-1.0f, -2.0f, -3.0f, -4.0f);

    printf("\nmethod 2\n");
    c.v = _mm_add_ps(a.v, b.v);      // <---
    print_vector(a);
    print_vector(b);
    printf("1.b) Computed output 2: ");
    print_vector(c);

    exit(EXIT_SUCCESS);
}

这两种测试情况分享以下内容：

both testing cases share the following:

typedef union float4{
    __m128  v;
    float   x,y,z,w;
} float4;

void print_vector (float4 v){
    printf("%f,%f,%f,%f\n", v.x, v.y, v.z, v.w);
}

所以，比较两种情况产生的code I使用编译：结果
gcc的-ggdb -msse -c t_vectorExtensions_method1.c

这导致（仅示出其中两个矢量相加哪位不同的部分）：结果，
方法1：

Which resulted in (showing only the part where the two vectors are added -which differs):
Method 1:

    c.v = a.v + b.v;
  a1:   0f 57 c9                xorps  %xmm1,%xmm1
  a4:   0f 12 4d d0             movlps -0x30(%rbp),%xmm1
  a8:   0f 16 4d d8             movhps -0x28(%rbp),%xmm1
  ac:   0f 57 c0                xorps  %xmm0,%xmm0
  af:   0f 12 45 c0             movlps -0x40(%rbp),%xmm0
  b3:   0f 16 45 c8             movhps -0x38(%rbp),%xmm0
  b7:   0f 58 c1                addps  %xmm1,%xmm0
  ba:   0f 13 45 b0             movlps %xmm0,-0x50(%rbp)
  be:   0f 17 45 b8             movhps %xmm0,-0x48(%rbp)

方法2：

    c.v = _mm_add_ps(a.v, b.v);
  a1:   0f 57 c0                xorps  %xmm0,%xmm0
  a4:   0f 12 45 a0             movlps -0x60(%rbp),%xmm0
  a8:   0f 16 45 a8             movhps -0x58(%rbp),%xmm0
  ac:   0f 57 c9                xorps  %xmm1,%xmm1
  af:   0f 12 4d b0             movlps -0x50(%rbp),%xmm1
  b3:   0f 16 4d b8             movhps -0x48(%rbp),%xmm1
  b7:   0f 13 4d f0             movlps %xmm1,-0x10(%rbp)
  bb:   0f 17 4d f8             movhps %xmm1,-0x8(%rbp)
  bf:   0f 13 45 e0             movlps %xmm0,-0x20(%rbp)
  c3:   0f 17 45 e8             movhps %xmm0,-0x18(%rbp)
/* Perform the respective operation on the four SPFP values in A and B.  */

extern __inline __m128 __attribute__((__gnu_inline__, __always_inline__, __artificial__))
_mm_add_ps (__m128 __A, __m128 __B)
{
  return (__m128) __builtin_ia32_addps ((__v4sf)__A, (__v4sf)__B);
  c7:   0f 57 c0                xorps  %xmm0,%xmm0
  ca:   0f 12 45 e0             movlps -0x20(%rbp),%xmm0
  ce:   0f 16 45 e8             movhps -0x18(%rbp),%xmm0
  d2:   0f 57 c9                xorps  %xmm1,%xmm1
  d5:   0f 12 4d f0             movlps -0x10(%rbp),%xmm1
  d9:   0f 16 4d f8             movhps -0x8(%rbp),%xmm1
  dd:   0f 58 c1                addps  %xmm1,%xmm0
  e0:   0f 13 45 90             movlps %xmm0,-0x70(%rbp)
  e4:   0f 17 45 98             movhps %xmm0,-0x68(%rbp)

使用内部 _mm_add_ps时，显然产生code（）大得多。为什么是这样？难道不应该得到更好的code？

Obviously the code generated when using the intrinsic _mm_add_ps() is much larger. Why is this? Shouldn't it result in better code?

与上证所内部函数性能 [英] performance of intrinsic functions with sse

问题描述

推荐答案

相关文章

C/C++最新文章

热门教程

热门工具

登录关闭

与上证所内部函数性能 [英] performance of intrinsic functions with sse

问题描述

推荐答案

相关文章

C/C++最新文章

热门教程

热门工具

登录 关闭

登录关闭