通过线程简单地分工并不会减少花费的时间 [英] Simple division of labour over threads is not reducing the time taken

查看:75
本文介绍了通过线程简单地分工并不会减少花费的时间的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我一直在尝试通过将工作划分为任务/线程来缩短项目的计算时间,但效果却不尽人意.因此,我决定创建一个简单的测试项目,以查看是否可以在非常简单的情况下使它正常工作,而且也无法按我的预期进行.

我试图做的是:

  • 在一个线程中执行X次任务-检查花费的时间.
  • 在Y个线程中执行任务X/Y次-检查时间.

因此,如果1个线程花费T秒执行工作"的100'000'000次迭代,则执行迭代".那么我会期望的:

  • 2个线程各自进行5万次迭代将花费〜T/2秒
  • 3个线程各自执行33'333'333迭代将花费〜T/3秒

依次类推,直到达到某些线程限制(内核数或其他)为止.

所以我编写了代码,并在我的8核系统(AMD Ryzen)上进行了测试,RAM大于16GB,当时什么也没做.

  • 1个线程耗时:〜6.5s
  • 2个线程耗时:〜6.7s
  • 花费了3个线程:〜13.3s
  • 花费了8个线程:〜16.2s

所以很明显这里不对!

我将代码移植到Godbolt中,并且看到了类似的结果.Godbolt仅允许3个线程,并且对于1个,2个或3个线程,大约需要8秒钟(大约1秒钟)才能运行.这是Godbolt的实时代码: https://godbolt.org/z/6eWKWr

最后是参考代码:

  #include< iostream>#include< math.h>#include< vector>#include< thread>#定义randf()((双)rand())/((双)(RAND_MAX))void thread_func(uint32_t交互,uint32_t thread_id){//打印线程ID/工作量std :: cout<<"启动线程:"<<thread_id<<"工作量:<<互动<<std :: endl;//获取开始时间自动启动= std :: chrono :: high_resolution_clock :: now();//为所需的交互次数做一些工作对于(auto i = 0u; i< interations; i ++){双精度值= randf();double calc = std :: atan(value);(无效)计算;}//获取时间自动total_time = std :: chrono :: high_resolution_clock :: now()-开始;//打印出来std :: cout<<"线程:"<<thread_id<<"在以下时间完成:<<std :: chrono :: duration_cast< std :: chrono :: milliseconds>(总时间).count()<<"ms"<<std :: endl;}int main(){//请注意,这些数字之间的差异可能大约是由于Godbolt服务器负载(?)//1个线程耗时:〜8s//2个线程耗时:〜8s//3个线程耗时:〜8suint32_t num_threads = 3;//最高可容纳3个uint32_t total_work = 100'000'000;//种子兰德std :: srand(static_cast< unsigned long>(std :: chrono :: steady_clock :: now().time_since_epoch().count()));//存储开始时间自动全面启动= std :: chrono :: high_resolution_clock :: now();//启动所有工作线程std :: vector< std :: thread>任务列表;为(uint32_t thread_id = 1; thread_id< = num_threads; thread_id ++){task_list.emplace_back(std :: thread([=](){thread_func(total_work/num_threads,thread_id);})));}//等待线程完成用于(自动任务:task_list){task.join();}//获取结束时间并打印自动total_total_time = std :: chrono :: high_resolution_clock :: now()-全面启动;std :: cout<<" \ n ========================= n<<"线程整体时间总计时间:"<<std :: chrono :: duration_cast< std :: chrono ::毫秒(> overall_total_time).count()<<"ms"<<std :: endl;返回0;} 

注意:我也尝试过使用std :: async,没有区别(不是我所期望的).我也尝试编译发布-没什么区别.

我已阅读以下问题:解决方案

rand 函数不能保证是线程安全的.看来,在您的实现中,它是通过使用锁或互斥锁来实现的,因此,如果多个线程尝试生成轮流使用的随机数.由于循环主要只是对 rand 的调用,因此多线程会降低性能.

您可以使用< random> 标头的功能,并使每个线程使用其自己的引擎来生成随机数.

I have been trying to improve computation times on a project by splitting the work into tasks/threads and it has not been working out very well. So I decided to make a simple test project to see if I can get it working in a very simple case and this also is not working out as I expected it to.

What I have attempted to do is:

  • do a task X times in one thread - check the time taken.
  • do a task X / Y times in Y threads - check the time taken.

So if 1 thread takes T seconds to do 100'000'000 iterations of "work" then I would expect:

  • 2 threads doing 50'000'000 iterations each would take ~ T / 2 seconds
  • 3 threads doing 33'333'333 iterations each would take ~ T / 3 seconds

and so on until I reach some threading limit (number of cores or whatever).

So I wrote the code and tested it on my 8 core system (AMD Ryzen) plenty of RAM >16GB doing nothing else at the time.

  • 1 Threads took: ~6.5s
  • 2 Threads took: ~6.7s
  • 3 Threads took: ~13.3s
  • 8 Threads took: ~16.2s

So clearly something is not right here!

I ported the code into Godbolt and I see similar results. Godbolt only allows 3 threads, and for 1, 2 or 3 threads it takes ~8s (this varies by about 1s) to run. Here is the godbolt live code: https://godbolt.org/z/6eWKWr

Finally here is the code for reference:

#include <iostream>
#include <math.h>
#include <vector>
#include <thread>

#define randf() ((double) rand()) / ((double) (RAND_MAX))

void thread_func(uint32_t interations, uint32_t thread_id)
{
    // Print the thread id / workload
    std::cout << "starting thread: " << thread_id << " workload: " << interations << std::endl;
    // Get the start time
    auto start = std::chrono::high_resolution_clock::now();
    // do some work for the required number of interations
    for (auto i = 0u; i < interations; i++)
    {
        double value = randf();
        double calc = std::atan(value);
        (void) calc;
    }
    // Get the time taken
    auto total_time = std::chrono::high_resolution_clock::now() - start;
    // Print it out
    std::cout << "thread: " << thread_id << " finished after: "
              << std::chrono::duration_cast<std::chrono::milliseconds>(total_time).count()
              << "ms" << std::endl;
}

int main()
{
    // Note these numbers vary by about probably due to godbolt servers load (?)
    // 1 Threads takes: ~8s
    // 2 Threads takes: ~8s
    // 3 Threads takes: ~8s
    uint32_t num_threads = 3; // Max 3 in godbolt
    uint32_t total_work = 100'000'000;

    // Seed rand
    std::srand(static_cast<unsigned long>(std::chrono::steady_clock::now().time_since_epoch().count()));

    // Store the start time
    auto overall_start = std::chrono::high_resolution_clock::now();

    // Start all the threads doing work
    std::vector<std::thread> task_list;
    for (uint32_t thread_id = 1; thread_id <= num_threads; thread_id++)
    {
        task_list.emplace_back(std::thread([=](){ thread_func(total_work / num_threads, thread_id); }));
    }

    // Wait for the threads to finish
    for (auto &task : task_list)
    {
        task.join();
    }

    // Get the end time and print it
    auto overall_total_time = std::chrono::high_resolution_clock::now() - overall_start;
    std::cout << "\n==========================\n"
              << "thread overall_total_time time: "
              << std::chrono::duration_cast<std::chrono::milliseconds>(overall_total_time).count()
              << "ms" << std::endl;
    return 0;
}

Note: I have tried using std::async also with no difference (not that I was expecting any). I also tried compiling for release - no difference.

I have read such questions as: why-using-more-threads-makes-it-slower-than-using-less-threads and I can't see an obvious (to me) bottle neck:

  • CPU bound (needs lots of CPU resources): I have 8 cores
  • Memory bound (needs lots of RAM resources): I have assigned my VM 10GB ram, running nothing else
  • I/O bound (Network and/or hard drive resources): No network trafic involved
  • There is no sleeping/mutexing going on here (like there is in my real project)

Questions are:

  • Why might this be happening?
  • What am I doing wrong?
  • How can I improve this?

解决方案

The rand function is not guaranteed to be thread safe. It appears that, in your implementation, it is by using a lock or mutex, so if multiple threads are trying to generate a random number that take turns. As your loop is mostly just the call to rand, the performance suffers with multiple threads.

You can use the facilities of the <random> header and have each thread use it's own engine to generate the random numbers.

这篇关于通过线程简单地分工并不会减少花费的时间的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆