提高后续代码使用线程的效率 [英] increasing efficiency of following code using threads

查看:180
本文介绍了提高后续代码使用线程的效率的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我使用的机器有8个内核和32GB RAM。在这台机器,我运行的代码在c + +使用VS2010在Windows x64上需要3天完成8棵树(8是外层线程的数量)。我搜索瓶颈,发现 crossCorrelate 方法需要大约75-80%的时间。现在,我想让这个方法更有效率,代码如下:

  int main(){
int numThread = 8;
//创建线程,为它们中的每一个运行build_tree方法
//并在运行所有这些之后加入
}

//我正在创建8树

  void build_tree ){//百万次调用
for(some_value to another_val){
//做一些东西
read_corresponding_matrices
crossCorrelate(mat1,mat2);
}
//将结果写入文件
}

//每个树都使用自己的数据,树之间没有依赖关系。

  Mat crossCorrelate(Mat mat1_real,Mat mat2_real) {
Mat mat1,mat2,result;

//第一个多线程部分//约20 ms
标量mean1 = mean(mat1_real);
subtract(mat1_real,(float)mean1 [0],mat1);

标量mean2 =平均值(mat2_real);
subtract(mat2_real,(float)mean2 [0],mat2);
//第一部分结束

Mat tilted_mat2 = flip_cross(mat2);

Mat planes [] = {Mat_< float>(mat1),Mat :: zeros(mat1.size(),CV_32F)};
Mat planes2 [] = {Mat_< float>(tilted_mat2),Mat :: zeros(mat1.size(),CV_32F)};

Mat complexI;

//第二多线程部分//约150 ms
merge(planes,2,complexI);
dft(complexI,complexI);
split(complexI,planes);

merge(planes2,2,complexI);
dft(complexI,complexI);
split(complexI,planes2);
//第二个m-t部分结束

//使用mat1,mat2,plane等做一些操作
clock_t s11 = clock();
cout<< 总时间diff s11-s1 < endl;

返回结果;
}

这是我想要提高效率的方法。该部分每个呼叫大约需要600 ms。我想到的是使该方法的一些独立部分多线程,并找到两个可以并行写入的地方。



为此目的,我为每个(第1和第2个mt部分)编写了两个简单的代码,并运行这些方法:

  t1 = boost :: thread(subtract_mean,mat1_real,mat1); 

subtract_mean(mat_ori,mat){
Scalar mean1 = mean(mat_ori);
subtract(mat_ori,(float)mean1 [0],mat1);
}

类似地,第二个线程为每个dft创建两个线程(dft_thread)



代码包含了大量的计算,所以当我运行它时,cpu的使用率约为90%。



在运行内线程之前,我期待更好的结果,但不是。



问题:为什么我运行时不运行 dft_thread sub_thread 我的代码运行速度更快?如何使crossCorrelation更快?我可以使用内部线程,我使用一次,一遍又一遍,这会使我的代码更快吗?



EDIT :我做了一些新的测试:
我没有内线程,并检查当外部线程的数量是1-2-4-6-8的树大小= 16时会发生什么。这里是结果:



numThread 1 ------ 2 ------ 4 ------ 6 ------ 8



所需时间 29 ----- 35 ----- 51 ----- 77 ----- 104(秒)



avg_time 29 ---- 17.5 ---- 12.7 ---- 12.8 ---- 13(秒)



我认为这表明,我可以使线程速度快2.5倍。我期待/认为它是5-6倍快8线程。它应该是什么?



EDIT2 :我做了一个测试:



第一个:使用6线程运行代码



visual studio工程5次和运行6进程同时所有都运行一个线程(多线程与并行处理)



多线程需要141分钟,并行处理需要70分钟。



注意:使用一个线程运行一个进程需要53分钟。



可能是什么原因?任何人看到这样的异常情况?我想两个应该是在相同的速度(也许多线程是有点更快),因为他们使用相同的资源量,我错了吗?



感谢,

解决方案

它将不会非常易读:


  1. 尝试避免任何函数的参数/ / strong>



    好的 const 参数不够,如果可以使用全局变量,垃圾。例如在 crossCorrelate(mat1,mat2); mat1,mat2 可以是全局的)。参数在指针引用的最佳情况下。在这种情况下,它不是一个大的交易,但仍然可以买一些时间。在更糟的情况下,它在每次调用时被复制到新对象中。当你的矩阵很大时,需要时间。


  2. 避免在经常执行的代码中动态分配



    如果可能,只分配一次。现代的 C / C ++ 引擎有很好的内存管理器,所以这不会花太多时间,但是 1-5%有时计数

    / li>
  3. 查看DFT



    如前所述,应计算为 DFFT 。我不知道如果你有足够快的实施 DFFT ,但如果你的输入数据是所有的时间具有相同的矩阵大小,那么你可以预先计算权重一次,并一直使用它们。将大大加快 DFFT / IDFFT



    BTW merge,dft,split 可以重写(到位,无参数)。或者您可以使用双缓冲技术(执行时交换指针)。



    在写入时,您无法进入源代码, strong> DFFT / IDFFT



    NTT / INTT 如果您的算法只是对其属性使用 FFT ,那么有时 NTT 会更快,但如果您的输入数据很复杂,那么您就没有别的选择。


  4. 您正在阅读矩阵(我假设来自某个文件)



    检查性能。如果它是一个二进制文件比你没有什么改进,但如果是在文本形式检查阅读效率。例如, WinAPI ini 文件读取速度比之前高效写入 ini


    $


  5. b $ b


    • 根据CPU计数使用线程计数。如果你有4xCPU,那么100个线程不会比4个线程快,但实际上会更慢。

    • 有时可以更改线程/进程优先级/类


    • 您可以使用亲和力掩码来更好地了解哪个线程在哪个CPU上运行,有时它会有帮助。

PS。 btw您的矩阵有多大?



[Edit1] 当你去并行/多线程,你正在访问 N 倍更多的资源...




  • 您的单矩阵是1K x 1K x float = 4 MB

  • 在FFT后,您切换到复杂状态,因此变为8 MB

  • 如果您不使用原位算法(例如 A = 1),则您对2个矩阵(例如AB)执行一些操作,因此为16 MB

  • AB ,但 C = AB ),那么您正在使用24 MB



因此,检查您的计算机上的 CACHE 大小,并有瓶颈(至少在我看来)。此外,当你有矩阵作为操作数或返回值没有引用(不是指针,但对象),那么你可以为每个添加8 MB。在 N = 1024 时,还要考虑2D (I)DFFT 内的递归调用数量!堆垃圾的数量是可怕的。


I'm using a machine having 8 cores and 32GB ram. In this machine, I'm running a code in c++ using VS2010 on Windows x64 which takes 3 days to complete 8 trees(8 is the number of outer threads). I searched for bottleneck and find out that crossCorrelate method takes around 75-80% of the time. Now, I'm trying to make that method more efficient, code is as follows:

int main(){
    int numThread = 8;
    //create threads, run build_tree method for each of them
    //and join after running all of them
}

// I'm creating 8 tree

void build_tree(int i){  //called millions of times 
    for(some_value to another_val){
        //do some stuff
        read_corresponding_matrices
        crossCorrelate(mat1,mat2);
    }
    //write the results to a file 
}

//each tree is working with its own data, no dependency between trees.

Mat crossCorrelate(Mat mat1_real, Mat mat2_real){
    Mat mat1, mat2,result;

    //1st multi-threading part  // around 20 ms
    Scalar mean1 = mean(mat1_real);
    subtract(mat1_real,(float)mean1[0],mat1);

    Scalar mean2 = mean(mat2_real);
    subtract(mat2_real,(float)mean2[0],mat2);
    //1st part ends

    Mat tilted_mat2 = flip_cross(mat2);

    Mat planes[] = {Mat_<float>(mat1), Mat::zeros(mat1.size(), CV_32F)};
    Mat planes2[] = {Mat_<float>(tilted_mat2), Mat::zeros(mat1.size(), CV_32F)};

    Mat complexI;

    //2nd multi-threaded part   //around 150 ms
    merge(planes, 2, complexI);                     
    dft(complexI, complexI);                        
    split(complexI, planes);                        

    merge(planes2, 2, complexI);            
    dft(complexI, complexI);                        
    split(complexI, planes2);
    //2nd m-t part ends 

    // do some operations with mat1, mat2, planes etc
    clock_t s11 = clock();
    cout << "total time diff " << s11-s1 << endl;

    return result;
}

This is the method that I want to make more efficient. This part takes around 600 ms for each call. What I thought is to make some independent parts of the method multi-threaded and found two places that can be written in parallel.

For this aim, I wrote two simple code for each (1st and 2nd m-t parts), and run those methods:

t1 = boost::thread( subtract_mean, mat1_real, mat1); 

subtract_mean(mat_ori, mat){
    Scalar mean1 = mean(mat_ori);
    subtract(mat_ori,(float)mean1[0],mat1);
}

similarly 2nd thread creates two thread for each dft.(dft_thread)

The code includes a lot of computations so, when I run it cpu usage becomes around 90%.

Before running with inner threads, I was expecting a better result however it is not.

Here are my question: Why does my code is working faster when I run without dft_thread and sub_thread? How can I make crossCorrelation faster? Could I use an inner thread, I used once, over and over by doing that would it make my code faster? Is there a clever way of inserting inner threads to my code?

EDIT: I did some new tests: I have no inner thread and checked what happens when the number of outer threads are 1-2-4-6-8 for tree size = 16. Here are the results:

numThread 1 ------ 2 ------ 4 ------ 6 ------ 8

Time takes 29 ----- 35 ----- 51 ----- 77 ----- 104 (in sec)

avg_time 29 ---- 17.5 ---- 12.7 ---- 12.8 ---- 13 (in sec)

I think this shows, I can on make 2.5 time faster with threads. I was expecting/thinking it is 5-6 times faster with 8 thread. Is it what it should have been? Am I doing something wrong or my understanding of threads fails?

EDIT2: I did one more test:

First one: running the code with 6 thread

The second one is copy the visual studio project 5 times and run 6 process at the same time all of them are running with one thread. (multithreading vs parallel processing)

multithreading takes 141 mins whereas, parallel processing takes 70 mins.

Note that: running one process with one thread takes 53 mins.

What could be the reason for that? Anybody seeing such an abnormal situation? I'm thinking both should be in the same speed (maybe multithreading is a bit more faster) as they are using same amount of resources, am I wrong?

Thanks,

解决方案

well not really an answer but as comment it will not be very readable:

  1. try to avoid any parameters/returns for any function that is called often

    Well const parameters are not enough if it is possible use global variables instead its much faster then heap trashing. For example in crossCorrelate(mat1,mat2); the mat1,mat2 can be global (own for each thread of course). Parameters are in the best case scenario referenced by pointer. In that case its not a big deal but still can buy some time. In worse case its copied into new object on every call. When your matrices are big then it takes time. And also do not forget the constructor/destructor is called too ...

  2. avoid dynamic allocation in often executed code

    Allocate only once if possible. Modern C/C++ engines have pretty good memory managers so this will not buy much time but even 1-5% sometimes count

  3. check DFT

    As mentioned before it should be computed as DFFT. I am not sure if you have fast enough implementation of DFFT but if your input data is all the time with the same matrix size then you can pre-compute weights once and use them all the time. It will speed up the DFFT / IDFFT significantly.

    BTW merge,dft,split can be rewritten too (to be in place and without parameters). Or you can use double buffering techniques (swap pointers on execute).

    As you wrote you cannot go inside source so try to use different DFFT/IDFFT

    What about NTT/INTT ? If your algorithm just use FFT for its properties then sometimes NTT is faster but if your input data is complex then you have no other choice.

  4. you are reading matrices (I assume from some file)

    Check performance of that. If it is a binary file than you have nothing to improve, but if it is in text form check the reading efficiency. For example WinAPI ini file reading is about 1000x times slower then efficiently written ini parser in C++ especially for big files.

  5. you can try to improve performance by better thread management

    • use threads count according to CPU count. If you have 4xCPU then 100 threads will not be faster than 4 threads but slower actually.
    • You can change thread/process priority/class
    • sometimes well placed Sleep() actually speed things up
    • You can play with affinity masks to better profile which thread runs on which CPU, sometimes it helps.

PS. btw how big are your matrices ?

[Edit1]

When you go to parallel/multi-threading you are accessing N times more resources at once ...

  • single matrix of yours is 1K x 1K x float = 4 MB
  • after FFT you switch to complex so it became 8 MB
  • you are doing some operations on 2 matrices (A-B for example) so that is 16 MB
  • if you do not use in place algorithms (like A=A-B but C=A-B) then you are using 24 MB

So check for the CACHE size on you computer and there is your bottleneck (at least in my opinion). Also when you have matrix as operand or return value without reference (not pointer but object) then you can add 8 MB for each of them. Also consider the amount of recursion calls inside 2D (I)DFFT when N = 1024 !!! The amount of heap trashing is horrible.

这篇关于提高后续代码使用线程的效率的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆