我如何减少pthread_join的影响。 Mingw32,c ++ [英] How can I reduce the effect of pthread_join. Mingw32, c++

查看:163
本文介绍了我如何减少pthread_join的影响。 Mingw32,c ++的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我创建了一个C ++函数,它是更大项目的一部分。这个功能被称为很多。为了提高性能,我们决定将该功能分成4部分,每部分并行运行两部分。完整的程序只接受一个输入和一个输入,然后进行模拟,它将长度为2000的变量传递给所讨论的函数。

该函数对变量进行操作(20,096个最大操作,150,000个加法,无乘法)。这些操作由func1和func2并行完成,两次(所以每次函数执行四分之一的操作)。这两个函数在内存中共享相同的输入( double signal 大小为700(只读), double A,B,C,H ,(所有大小(双倍)5600,写入和读取))和输出(大小为700的双L )。

不需要互斥锁,因为func1在A,B,C,H(读写)的一半上工作,并在L中写入其一半,而func2在其一半中写入相同。但是,有些情况下,函数或线程同时读取Signal。在第二次调用时,线程几乎完成相同的操作。

问题是Threaded程序运行速度比串行程序慢。当我单独计算每个func时,它们运行原函数时间总功能时间的四分之一,这是有道理的,因为func1被调用两次,func2也被调用两次。我使用clock_t clock()进行计时(这用于测量窗口中的挂钟,而不是标准中的规定)。但这与其他计时工具(如Windows QueryPerformanceCounter)是一致的。

我对所有东西都进行了计时,并尝试了我所看到的一切。我使用了优化optoins -O3 O2 Ofast。我为每个线程创建了一个单独的内存(即使是只读数组,然后复制了结果)。

我有两种理论,分别是
1- pthreads的开销花费的时间与函数花费的时间相同
2- main()是在等待pthread_join时睡觉。

我更加相信理论2,因为他们只是把时间浪费在pthread_join的某个地方。

我编写了这个示例代码来模拟这个问题。请注意,循环位置对于我正在实现的算法是必不可少的,因此移动操作以使用较少的循环将不起作用。



请注意,如果您增加数据的大小(j <10000和j <5000)并且相应地减少计数范围,则线程化程序的性能开始变好。



运行时间为1.3秒。

  #include  #include< pthread.h> 
#include< iostream>
#include< time.h>
使用namespace std;

int main(){
int i,m,j,k;

clock_t time_time;
time_time = clock();对于(j = 0; j <10000; j ++){
m = j; $;


为(int count = 0; count< 50000; ++ count)
k = j + 1;
i = m * j;
}
}
cout<<time spent =<<双(时钟() - time_time)/ CLOCKS_PER_SEC<< ENDL;
}

在同一处理器上运行5秒。

  #include< math.h> 
#include< pthread.h>
#include< iostream>
#include< time.h>

使用namespace std;

void test(int i);
$ b void void * thread_func(void * arg){
int idxThread = *((int *)arg);
test(1);
返回NULL;
}

void test(int i){
int j,k,m;
int q = 0,w = 1,e = 2,r = 3,t = 4;
int a = 1,s = 1,d = 1,f = 3,g = 3; (j = 0; j <5000; j ++){
m = j;
k = j + 1;
i = m * j;
}
}

int main(){
int numThreads = 2;

clock_t time_time;
pthread_t threads [numThreads];
unsigned int threadIDs [numThreads];
time_time = clock(); (unsigned int id = 0; id< numThreads; ++ id)
{b; b; b
for(int count = 0; count< 50000; ++ count)
threadIDs [id] = id;
pthread_create(&(threads [id]),NULL,thread_func,(void *)&(threadIDs [id])); (unsigned int id = 0; id< numThreads; ++ id)
{
pthread_join(threads [id],NULL);
}
;
}
}
cout<<time spent =<<双(时钟() - time_time)/ CLOCKS_PER_SEC<< ENDL;

$ / code>

编辑:对线程函数的50000调用是为了说明问题,在我的代码,他们只是2次调用func1,而func2两次,这是4次创建和连接。这似乎需要2毫秒。



操作系统:windows,mingw32,pthreads C ++。
CPU i7,RAM:8Gb

  makefile:
CC = g ++ -O3 -I ... ......... -Wformat -c
LINK = g ++ -Wl, - stack,8388608 -o
LINKFLAGS = -lpthread


解决方案


  1. 不要创建和连接线程。保持一个线程池的运行并根据需要为它们分配任务。


  2. 除非你别无选择,否则不要等待任务完成。相反,完成任务触发器的工作不需要等待。

  3. I created a C++ function that is a part of a bigger project. This function is called a lot. In order to enhance the performance we decided to split that function into 4 parts, each two running in parallel. The complete program takes one input, and one input only, then it does a simulation, it passes a variable of length 2000 to the function in question.

    This function operates on the variable (20,096 max operations,150,000 additions, and no multiplications). Those operations are done by func1 and func2 in parallel, twice (so every time each function does quarter of those opperations). both functions share the same input in memory (double Signal of size 700 (read only), double A, B, C, H, (all of size (double) 5600, write and read) ) and output (double L of size 700).

    No mutexes are necessary because func1 works on one half of A,B,C,H (read and write), and writes into its half in L, while func2 does the same in its half. However, there are instances where both functions, or threads, are reading Signal at the same time. On the second call, the threads almost do the same operations.

    The problem is that the Threaded program runs a bit slower than the serial program. When I time each func alone they run 1/4th of the total function time of the original function time, which makes sense as func1 is called twice, and func2 is called twice as well. I use clock_t clock() for timing (This measures wall clock in windows, not as specified in the standard). but that was coherent with other timing tools like windows QueryPerformanceCounter.

    I timed everything, and tried everything I saw. I used the optimizing optoins -O3 O2 Ofast. I created a separate memory for each thread (even for the read only arrays, then copied results).

    I have two theories in mind 1- overhead of pthreads is taking as much time as the functions are taking 2- main() is sleeping while waiting for pthread_join.

    I am more convinced with theory 2 because they only place time is lost is the somewhere in the pthread_join.

    I wrote this sample code to simulate the problem. please note that the loop positions are essential in the algorithm I am implementing, so moving operations to use less loops will not work.

    Note that if you increment the size of data (j<10000 and j<5000) and decrease the count range correspondingly, the performance of the threaded program begins to perform better.

    This runs in 1.3 seconds.

    #include <math.h>
    #include <pthread.h>
    #include <iostream>
    #include <time.h>
    using namespace std;
    
    int main(){
        int i,m,j,k;
    
        clock_t time_time;
        time_time=clock();
    
        for (int count =0 ; count<50000;++count){
            for (j=0;j<10000;j++){
                m=j;
                k=j+1;
                i=m*j;
            }
        }
        cout<<"time spent = "<< double(clock()-time_time)/CLOCKS_PER_SEC<<endl;
    }
    

    This runs in 5 seconds on the same processor.

    #include <math.h>
    #include <pthread.h>
    #include <iostream>
    #include <time.h>
    
    using namespace std;
    
    void test (int i);
    
    void *thread_func(void *arg){
        int idxThread = *((int *) arg);
        test (1);
        return NULL;
    }    
    
    void test (int i){  
        int j,k,m;
        int q=0,w=1,e=2,r=3,t=4;
        int a=1,s=1,d=1,f=3,g=3;
        for (j=0;j<5000;j++){
            m=j;
            k=j+1;
            i=m*j;
        }
    }
    
    int main(){
        int numThreads=2;
    
        clock_t time_time;
        pthread_t threads[numThreads];
        unsigned int threadIDs[numThreads];
        time_time =clock();
    
        for (int count =0 ; count<50000;++count){
            for (unsigned int id = 0; id < numThreads; ++id)
            {
                threadIDs[id]=id;
                pthread_create(&(threads[id]), NULL, thread_func, (void *) &(threadIDs[id]));
            }
            for (unsigned int id = 0; id < numThreads; ++id)
            {
                pthread_join(threads[id], NULL);
            }
        }
            cout<<"time spent = "<< double(clock()-time_time)/CLOCKS_PER_SEC<<endl;
    }
    

    EDIT: The 50000 calls to the thread function is to illustrate the problem, in my code, they are just 2 calls for func1, and func2, twice, which is 4 creations and joins. which seems to take 2 milliseconds.

    OS: windows, mingw32, pthreads C++. CPU i7, RAM:8Gb

    makefile: 
    CC = g++ -O3 -I............ -Wformat -c 
    LINK = g++ -Wl,--stack,8388608 -o
    LINKFLAGS = -lpthread
    

    解决方案

    1. Don't create and join threads. Keep a pool of threads running and assign them tasks as needed.

    2. Don't wait for tasks to complete unless you have no choice. Instead, have the completion of the task trigger work to be done without waiting.

    这篇关于我如何减少pthread_join的影响。 Mingw32,c ++的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆