为什么一个线程比调用函数mingw更快? [英] Why is one thread faster than just calling a function, mingw

查看:194
本文介绍了为什么一个线程比调用函数mingw更快?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

当我调用函数执行时间为6.8秒。
从线程调用它的时间是3.4秒
,当使用2线程1.8秒时。无论我使用什么样的优化,口粮保持不变。



在Visual Studio中,时间就像预期的3.1,3和1.7秒一样。

  #include< math.h> 
#include< stdio.h>
#include< windows.h>
#include< time.h>

使用namespace std;

#define N 400

float a [N] [N];

struct b {
int begin;
int end;
};

DWORD WINAPI线程(LPVOID p)
{
b b_t = *(b *)p;

for(int i = 0; i for(int j = b_t.begin; j {
a [i] [j] = 0; (int k = 0; k< i; k ++)
a [i] [j] + = k * sin(j)-j * cos(k)
}

return(0);
}

int main()
{
clock_t t;
HANDLE hn [2];

b b_t [3];

b_t [0] .begin = 0;
b_t [0] .end = N;

b_t [1] .begin = 0;
b_t [1] .end = N / 2;

b_t [2] .begin = N / 2;
b_t [2] .end = N;

t = clock();
线程(& b_t [0]);
printf(0 - %d \ n,clock() - t);

t = clock();
hn [0] = CreateThread(NULL,0,thread,& b_t [0],0,NULL);
WaitForSingleObject(hn [0],INFINITE);
printf(1 - %d \ n,clock() - t);

t = clock();
hn [0] = CreateThread(NULL,0,thread,& b_t [1],0,NULL);
hn [1] = CreateThread(NULL,0,thread,& b_t [2],0,NULL);
WaitForMultipleObjects(2,hn,TRUE,INFINITE);
printf(2 - %d \\\
,clock() - t);

返回0;
}

次数:

  0  -  6868 
1 - 3362
2 - 1827

CPU-Core 2 Duo T9300

操作系统 - Windows 8,64位

编译器:mingw32-g ++。exe,gcc版本4.6.2

编辑:



尝试不同的顺序,结果,甚至尝试了单独的应用。
任务管理器显示功能和1线程的CPU利用率为50%左右,2线程为100%

每次调用后所有元素的总和相同:3189909.237955

Cygwin结果:2.5,2.5和2.5秒
Linux结果(pthread):3.7,3.7和2.1秒

@borisbn results:0 - 1446 1 - 1439 2 - 721。

解决方案

在实现 sin() cos()的数学库中有些东西 - 如果用某些东西来替换这些函数的调用否则,需要时间步骤和0之间的显着差异,第1步消失。



请注意,我看到与 gcc(tdm-1 )4.6.1 ,这是一个针对32位二进制文​​件的32位工具链。优化没有什么区别(并不奇怪,因为它似乎是数学库中的东西)。不过,如果我使用 gcc(tdm64-1)4.6.1 构建,这是一个64位工具链,不同之处在于 不出现 - 无论构建是创建32位程序(使用 -m32 选项) ( -m64 )。



以下是一些示例测试运行(我对源代码做了小修改使它与C99兼容):
$ b $ ul
使用32位TDM MinGW 4.6.1编译器:

  C:\temp> gcc --version 
gcc(tdm-1)4.6.1

C: \temp> gcc -m32 -std = gnu99 -o test.exe test.c

C:\temp> test
0 - 4082
1 - 2439
2 - 1238


  • 使用64位TDM 4.6.1编译器: p>

      C:\temp> gcc --version 
    gcc(tdm64-1)4.6.1

    C:\temp> gcc -m32 -std = gnu99 -o test.exe test.c

    C:\temp>测试
    0 - 2506
    1 - 2476
    2 - 1254

    C:\temp> gcc -m64 -std = gnu99 -o test.exe test.c

    C:\temp> test
    0 - 3031
    1 - 3031
    2 - 1539




  • 更多信息:

    32位TDM发行版(gcc(tdm-1)4.6.1)链接到 sin()<通过提供的导入库,在 msvcrt.dll 系统DLL中执行/ code> / cos()

      c:/ mingw32 / bin /../ lib / gcc / mingw32 / 4.6.1 /../../ ../libmsvcrt.a(dcfls00599.o)
    0x004a113c _imp__cos

    (gcc(tdm64-1)4.6.1)似乎没有这样做,而是链接到分发提供的一些静态库实现:

      C:/ mingw64 / bin中/../ LIB / GCC / x86_64的-W64-的mingw32 / 4.6.1 /../../../../ x86_64的-W64-的mingw32 / lib中/../lib32/libmingwex.a(lib32_libmingwex_a-cos.o)
    C:\ Users \mikeb\\ AppData \Local\Temp\cc3pk20i.o(cos)






    更新/结束语:

    在调试器中进行一些探测之后,逐步完成 msvcrt.dll 的实现 cos()我发现主线程与显式创建线程的时间差别是由于FPU的精度被设置为非默认设置(大概MinGW运行时间在启动时会这样做)。在 thread()函数花费两倍的情况下,FPU被设置为64位精度( REAL10 或在MSVC中说 _PC_64 )。当FPU控制字不是0x27f(缺省状态?)时, msvcrt.dll 运行时将在中执行以下步骤sin ) cos()函数(可能还有其他浮点函数):


    • 保存当前的FPU控制字

    • 将FPU控制字设置为0x27f(我相信这个值可以修改)
    • 执行 fsin / fcos 操作

    • FPU控制字



    如果已将FPU控制字设置为期望的/期望的0x27f值,则保存/恢复FPU控制字将被跳过。显然,保存/恢复FPU控制字是非常昂贵的,因为它看起来会使函数占用的时间加倍。



    您可以通过将以下行添加到 main()在调用 thread()之前:

      _control87(_PC_53,_MCW_PC); //需要< float.h> 


    When I call function execution time is 6.8 sec. Call it from a thread time is 3.4 sec and when using 2 thread 1.8 sec. No matter what optimization I use rations stay same.

    In Visual Studio times are like expected 3.1, 3 and 1.7 sec.

    #include<math.h>
    #include<stdio.h>
    #include<windows.h>
    #include <time.h>
    
    using namespace std;
    
    #define N 400
    
    float a[N][N];
    
    struct b{
        int begin;
        int end;
    };
    
    DWORD WINAPI thread(LPVOID p)
    {
        b b_t = *(b*)p;
    
        for(int i=0;i<N;i++)
            for(int j=b_t.begin;j<b_t.end;j++)
            {
                a[i][j] = 0;
                for(int k=0;k<i;k++)
                    a[i][j]+=k*sin(j)-j*cos(k);
            }
    
        return (0);
    }
    
    int main()
    {
        clock_t t;
        HANDLE hn[2];
    
        b b_t[3];
    
        b_t[0].begin = 0;
        b_t[0].end = N;
    
        b_t[1].begin = 0;
        b_t[1].end = N/2;
    
        b_t[2].begin = N/2;
        b_t[2].end = N;
    
        t = clock();
        thread(&b_t[0]);
        printf("0 - %d\n",clock()-t);
    
        t = clock();
        hn[0] = CreateThread ( NULL, 0, thread,  &b_t[0], 0, NULL);
        WaitForSingleObject(hn[0], INFINITE );
        printf("1 - %d\n",clock()-t);
    
        t = clock();
        hn[0] = CreateThread ( NULL, 0, thread,  &b_t[1], 0, NULL);
        hn[1] = CreateThread ( NULL, 0, thread,  &b_t[2], 0, NULL);
        WaitForMultipleObjects(2, hn, TRUE, INFINITE );
        printf("2 - %d\n",clock()-t);
    
        return 0;
    }
    

    Times:

    0 - 6868
    1 - 3362
    2 - 1827
    

    CPU - Core 2 Duo T9300

    OS - Windows 8, 64 - bit

    compiler: mingw32-g++.exe, gcc version 4.6.2

    edit:

    Tried different order, same result, even tried separate applications. Task Manager showing CPU Utilization around 50% for function and 1 thread and 100% for 2 thread

    Sum of all elements after each call is the same: 3189909.237955

    Cygwin result: 2.5, 2.5 and 2.5 sec Linux result(pthread): 3.7, 3.7 and 2.1 sec

    @borisbn results: 0 - 1446 1 - 1439 2 - 721.

    解决方案

    The difference is a result of something in the math library implementing sin() and cos() - if you replace the calls to those functions with something else that takes time the significant difference between step and 0 and step 1 goes away.

    Note that I see the difference with gcc (tdm-1) 4.6.1, which is a 32-bit toolchain targeting 32 bit binaries. Optimization makes no difference (not surprising since it seems to be something in the math library).

    However, if I build using gcc (tdm64-1) 4.6.1, which is a 64-bit toolchain, the difference does not appear - regardless if the build is creating a 32-bit program (using the -m32 option) or a 64-bit program (-m64).

    Here are some example test runs (I made minor modifications to the source to make it C99 compatible):

    • Using the 32-bit TDM MinGW 4.6.1 compiler:

      C:\temp>gcc --version
      gcc (tdm-1) 4.6.1
      
      C:\temp>gcc -m32 -std=gnu99 -o test.exe test.c
      
      C:\temp>test
      0 - 4082
      1 - 2439
      2 - 1238
      

    • Using the 64-bit TDM 4.6.1 compiler:

      C:\temp>gcc --version
      gcc (tdm64-1) 4.6.1
      
      C:\temp>gcc -m32 -std=gnu99 -o test.exe test.c
      
      C:\temp>test
      0 - 2506
      1 - 2476
      2 - 1254
      
      C:\temp>gcc -m64 -std=gnu99 -o test.exe test.c
      
      C:\temp>test
      0 - 3031
      1 - 3031
      2 - 1539
      

    A little more information:

    The 32-bit TDM distribution (gcc (tdm-1) 4.6.1) links to the sin()/cos() implementations in the msvcrt.dll system DLL via a provided import library:

    c:/mingw32/bin/../lib/gcc/mingw32/4.6.1/../../../libmsvcrt.a(dcfls00599.o)
                    0x004a113c                _imp__cos
    

    While the 64-bit distribution (gcc (tdm64-1) 4.6.1) doesn't appear to do that, instead linking to some static library implementation provided with the distribution:

    c:/mingw64/bin/../lib/gcc/x86_64-w64-mingw32/4.6.1/../../../../x86_64-w64-mingw32/lib/../lib32/libmingwex.a(lib32_libmingwex_a-cos.o)
                                  C:\Users\mikeb\AppData\Local\Temp\cc3pk20i.o (cos)
    


    Update/Conclusion:

    After a bit of spelunking in a debugger stepping through the assembly of msvcrt.dll's implementation of cos() I've found that the difference in the timing of the main thread versus an explicitly created thread is due to the FPU's precision being set to a non-default setting (presumably the MinGW runtime in question does this at start up). In the situation where the thread() function takes twice as long, the FPU is set to 64-bit precision (REAL10 or in MSVC-speak _PC_64). When the FPU control word is something other than 0x27f (the default state?), the msvcrt.dll runtime will perform the following steps in the sin() and cos() function (and probably other floating point functions):

    • save the current FPU control word
    • set the FPU control word to 0x27f (I believe it's possible for this value to be modified)
    • perform the fsin/fcos operation
    • restore the saved FPU control word

    The save/restore of the FPU control word is skipped if it's already set to the expected/desired 0x27f value. Apparently saving/restoring the FPU control word is expensive, since it appears to double the amount of time the function takes.

    You can solve the problem by adding the following line to main() before calling thread():

    _control87( _PC_53, _MCW_PC);   // requires <float.h>
    

    这篇关于为什么一个线程比调用函数mingw更快?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

    查看全文
    登录 关闭
    扫码关注1秒登录
    发送“验证码”获取 | 15天全站免登陆