为什么一个线程比调用函数mingw更快? [英] Why is one thread faster than just calling a function, mingw
问题描述
从线程调用它的时间是3.4秒
,当使用2线程1.8秒时。无论我使用什么样的优化,口粮保持不变。
在Visual Studio中,时间就像预期的3.1,3和1.7秒一样。
#include< math.h>
#include< stdio.h>
#include< windows.h>
#include< time.h>
使用namespace std;
#define N 400
float a [N] [N];
struct b {
int begin;
int end;
};
DWORD WINAPI线程(LPVOID p)
{
b b_t = *(b *)p;
for(int i = 0; i for(int j = b_t.begin; j {
a [i] [j] = 0; (int k = 0; k< i; k ++)
a [i] [j] + = k * sin(j)-j * cos(k)
}
return(0);
}
int main()
{
clock_t t;
HANDLE hn [2];
b b_t [3];
b_t [0] .begin = 0;
b_t [0] .end = N;
b_t [1] .begin = 0;
b_t [1] .end = N / 2;
b_t [2] .begin = N / 2;
b_t [2] .end = N;
t = clock();
线程(& b_t [0]);
printf(0 - %d \ n,clock() - t);
t = clock();
hn [0] = CreateThread(NULL,0,thread,& b_t [0],0,NULL);
WaitForSingleObject(hn [0],INFINITE);
printf(1 - %d \ n,clock() - t);
t = clock();
hn [0] = CreateThread(NULL,0,thread,& b_t [1],0,NULL);
hn [1] = CreateThread(NULL,0,thread,& b_t [2],0,NULL);
WaitForMultipleObjects(2,hn,TRUE,INFINITE);
printf(2 - %d \\\
,clock() - t);
返回0;
}
次数:
0 - 6868
1 - 3362
2 - 1827
CPU-Core 2 Duo T9300
操作系统 - Windows 8,64位
编译器:mingw32-g ++。exe,gcc版本4.6.2
编辑:
尝试不同的顺序,结果,甚至尝试了单独的应用。
任务管理器显示功能和1线程的CPU利用率为50%左右,2线程为100%
每次调用后所有元素的总和相同:3189909.237955
Cygwin结果:2.5,2.5和2.5秒
Linux结果(pthread):3.7,3.7和2.1秒
@borisbn results:0 - 1446 1 - 1439 2 - 721。
在实现 sin()
和 cos()
的数学库中有些东西 - 如果用某些东西来替换这些函数的调用否则,需要时间步骤和0之间的显着差异,第1步消失。
请注意,我看到与 gcc(tdm-1 )4.6.1
,这是一个针对32位二进制文件的32位工具链。优化没有什么区别(并不奇怪,因为它似乎是数学库中的东西)。不过,如果我使用 gcc(tdm64-1)4.6.1
构建,这是一个64位工具链,不同之处在于 不出现 - 无论构建是创建32位程序(使用 -m32
选项) ( -m64
)。
以下是一些示例测试运行(我对源代码做了小修改使它与C99兼容):
$ b $ ul
使用32位TDM MinGW 4.6.1编译器:
C:\temp> gcc --version
gcc(tdm-1)4.6.1
C: \temp> gcc -m32 -std = gnu99 -o test.exe test.c
C:\temp> test
0 - 4082
1 - 2439
2 - 1238
使用64位TDM 4.6.1编译器: p>
C:\temp> gcc --version
gcc(tdm64-1)4.6.1
C:\temp> gcc -m32 -std = gnu99 -o test.exe test.c
C:\temp>测试
0 - 2506
1 - 2476
2 - 1254
C:\temp> gcc -m64 -std = gnu99 -o test.exe test.c
C:\temp> test
0 - 3031
1 - 3031
2 - 1539
更多信息:
32位TDM发行版(gcc(tdm-1)4.6.1)链接到 sin()<通过提供的导入库,在
msvcrt.dll
系统DLL中执行/ code> / cos()
c:/ mingw32 / bin /../ lib / gcc / mingw32 / 4.6.1 /../../ ../libmsvcrt.a(dcfls00599.o)
0x004a113c _imp__cos
(gcc(tdm64-1)4.6.1)似乎没有这样做,而是链接到分发提供的一些静态库实现:
C:/ mingw64 / bin中/../ LIB / GCC / x86_64的-W64-的mingw32 / 4.6.1 /../../../../ x86_64的-W64-的mingw32 / lib中/../lib32/libmingwex.a(lib32_libmingwex_a-cos.o)
C:\ Users \mikeb\\ AppData \Local\Temp\cc3pk20i.o(cos)
更新/结束语:
在调试器中进行一些探测之后,逐步完成 msvcrt.dll 的实现
)。当FPU控制字不是0x27f(缺省状态?)时, cos()
我发现主线程与显式创建线程的时间差别是由于FPU的精度被设置为非默认设置(大概MinGW运行时间在启动时会这样做)。在 thread()
函数花费两倍的情况下,FPU被设置为64位精度( REAL10
或在MSVC中说 msvcrt.dll
运行时将在中执行以下步骤sin )
和 cos()
函数(可能还有其他浮点函数):
fsin
/ fcos
操作
如果已将FPU控制字设置为期望的/期望的0x27f值,则保存/恢复FPU控制字将被跳过。显然,保存/恢复FPU控制字是非常昂贵的,因为它看起来会使函数占用的时间加倍。
您可以通过将以下行添加到 main()
在调用 thread()
之前:
_control87(_PC_53,_MCW_PC); //需要< float.h>
When I call function execution time is 6.8 sec. Call it from a thread time is 3.4 sec and when using 2 thread 1.8 sec. No matter what optimization I use rations stay same.
In Visual Studio times are like expected 3.1, 3 and 1.7 sec.
#include<math.h>
#include<stdio.h>
#include<windows.h>
#include <time.h>
using namespace std;
#define N 400
float a[N][N];
struct b{
int begin;
int end;
};
DWORD WINAPI thread(LPVOID p)
{
b b_t = *(b*)p;
for(int i=0;i<N;i++)
for(int j=b_t.begin;j<b_t.end;j++)
{
a[i][j] = 0;
for(int k=0;k<i;k++)
a[i][j]+=k*sin(j)-j*cos(k);
}
return (0);
}
int main()
{
clock_t t;
HANDLE hn[2];
b b_t[3];
b_t[0].begin = 0;
b_t[0].end = N;
b_t[1].begin = 0;
b_t[1].end = N/2;
b_t[2].begin = N/2;
b_t[2].end = N;
t = clock();
thread(&b_t[0]);
printf("0 - %d\n",clock()-t);
t = clock();
hn[0] = CreateThread ( NULL, 0, thread, &b_t[0], 0, NULL);
WaitForSingleObject(hn[0], INFINITE );
printf("1 - %d\n",clock()-t);
t = clock();
hn[0] = CreateThread ( NULL, 0, thread, &b_t[1], 0, NULL);
hn[1] = CreateThread ( NULL, 0, thread, &b_t[2], 0, NULL);
WaitForMultipleObjects(2, hn, TRUE, INFINITE );
printf("2 - %d\n",clock()-t);
return 0;
}
Times:
0 - 6868
1 - 3362
2 - 1827
CPU - Core 2 Duo T9300
OS - Windows 8, 64 - bit
compiler: mingw32-g++.exe, gcc version 4.6.2
edit:
Tried different order, same result, even tried separate applications. Task Manager showing CPU Utilization around 50% for function and 1 thread and 100% for 2 thread
Sum of all elements after each call is the same: 3189909.237955
Cygwin result: 2.5, 2.5 and 2.5 sec Linux result(pthread): 3.7, 3.7 and 2.1 sec
@borisbn results: 0 - 1446 1 - 1439 2 - 721.
The difference is a result of something in the math library implementing sin()
and cos()
- if you replace the calls to those functions with something else that takes time the significant difference between step and 0 and step 1 goes away.
Note that I see the difference with gcc (tdm-1) 4.6.1
, which is a 32-bit toolchain targeting 32 bit binaries. Optimization makes no difference (not surprising since it seems to be something in the math library).
However, if I build using gcc (tdm64-1) 4.6.1
, which is a 64-bit toolchain, the difference does not appear - regardless if the build is creating a 32-bit program (using the -m32
option) or a 64-bit program (-m64
).
Here are some example test runs (I made minor modifications to the source to make it C99 compatible):
Using the 32-bit TDM MinGW 4.6.1 compiler:
C:\temp>gcc --version gcc (tdm-1) 4.6.1 C:\temp>gcc -m32 -std=gnu99 -o test.exe test.c C:\temp>test 0 - 4082 1 - 2439 2 - 1238
Using the 64-bit TDM 4.6.1 compiler:
C:\temp>gcc --version gcc (tdm64-1) 4.6.1 C:\temp>gcc -m32 -std=gnu99 -o test.exe test.c C:\temp>test 0 - 2506 1 - 2476 2 - 1254 C:\temp>gcc -m64 -std=gnu99 -o test.exe test.c C:\temp>test 0 - 3031 1 - 3031 2 - 1539
A little more information:
The 32-bit TDM distribution (gcc (tdm-1) 4.6.1) links to the sin()
/cos()
implementations in the msvcrt.dll
system DLL via a provided import library:
c:/mingw32/bin/../lib/gcc/mingw32/4.6.1/../../../libmsvcrt.a(dcfls00599.o)
0x004a113c _imp__cos
While the 64-bit distribution (gcc (tdm64-1) 4.6.1) doesn't appear to do that, instead linking to some static library implementation provided with the distribution:
c:/mingw64/bin/../lib/gcc/x86_64-w64-mingw32/4.6.1/../../../../x86_64-w64-mingw32/lib/../lib32/libmingwex.a(lib32_libmingwex_a-cos.o)
C:\Users\mikeb\AppData\Local\Temp\cc3pk20i.o (cos)
Update/Conclusion:
After a bit of spelunking in a debugger stepping through the assembly of msvcrt.dll
's implementation of cos()
I've found that the difference in the timing of the main thread versus an explicitly created thread is due to the FPU's precision being set to a non-default setting (presumably the MinGW runtime in question does this at start up). In the situation where the thread()
function takes twice as long, the FPU is set to 64-bit precision (REAL10
or in MSVC-speak _PC_64
). When the FPU control word is something other than 0x27f (the default state?), the msvcrt.dll
runtime will perform the following steps in the sin()
and cos()
function (and probably other floating point functions):
- save the current FPU control word
- set the FPU control word to 0x27f (I believe it's possible for this value to be modified)
- perform the
fsin
/fcos
operation - restore the saved FPU control word
The save/restore of the FPU control word is skipped if it's already set to the expected/desired 0x27f value. Apparently saving/restoring the FPU control word is expensive, since it appears to double the amount of time the function takes.
You can solve the problem by adding the following line to main()
before calling thread()
:
_control87( _PC_53, _MCW_PC); // requires <float.h>
这篇关于为什么一个线程比调用函数mingw更快?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!