多线程基准 [英] Multi-threading benchmark

查看：72 发布时间：2020/5/14 0:47:20 c++ multithreading performance threadpool

本文介绍了多线程基准的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我进行了大量的数学计算，以计算双素数中的数量范围，并且我已将任务划分为多个线程.

I have had a heavy mathematical computation to count the number of twin prime numbers within a range and I have divided the task between threads.

在这里，您可以看到执行时间与线程数之间的关系.

Here you see the profile of the execution time against the number of threads.

我的问题是关于以下方面的合理性

My questions are about justification of:

为什么单线程和双线程的性能非常相似?

Why do single thread and dual threads have very similar performance?

为什么使用5或7线程时执行时间会减少，而使用6或8线程时执行时间会增加? (我在几次测试中都经历过.)

Why there is a drop in execution time when it is 5- or 7-threaded, while execution time increases when 6 or 8 threads are used? (I have experienced that in several tests.)

我使用了8核计算机.我可以断言2× n (其中 n 是内核数)根据经验是很好的线程数吗?

I have used an 8-core computer. Can I claim that 2×n (where n is the number of cores) is a good number of threads as a rule of thumb?

如果我使用的RAM占用率很高的代码，我会期望配置文件中出现类似的趋势，还是随着线程数量的增加而急剧变化?

If I use a code with a high usage of RAM, would I expect similar trends in the profile, or would it dramatically change with an increasing number of threads?

这是代码的主要部分，只是表明它不占用太多RAM.

This is the main part of the code only to show it does not use much RAM.

bool is_prime(long a)
{
    if(a<2l)
        return false;
    if(a==2l)
        return true;
    for(long i=2;i*i<=a;i++)
        if(a%i==0)
            return false;
    return true;
}

uint twin_range(long l1,long l2,int processDiv)
{
    uint count=0;
    for(long l=l1;l<=l2;l+=long(processDiv))
        if(is_prime(l) && is_prime(l+2))
        {
            count++;
        }
    return count;
}

规格:

$ lsb_release -a

Distributor ID: Ubuntu
Description:    Ubuntu 16.04.1 LTS
Release:    16.04
Codename:   xenial

$ lscpu

Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                8
On-line CPU(s) list:   0-7
Thread(s) per core:    2
Core(s) per socket:    4
Socket(s):             1
NUMA node(s):          1
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 94
Model name:            Intel(R) Core(TM) i7-6700 CPU @ 3.40GHz
Stepping:              3
CPU MHz:               799.929
CPU max MHz:           4000.0000
CPU min MHz:           800.0000
BogoMIPS:              6815.87
Virtualisation:        VT-x
L1d cache:             32K
L1i cache:             32K
L2 cache:              256K
L3 cache:              8192K
NUMA node0 CPU(s):     0-7
Flags:                 fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch intel_pt tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx rdseed adx smap clflushopt xsaveopt xsavec xgetbv1 dtherm ida arat pln pts hwp hwp_notify hwp_act_window hwp_epp

更新(接受答案后)

新个人资料:

改进的代码如下.现在，工作负载已公平分配.

The improved code is as follows. Now, the workload is distributed fairly.

bool is_prime(long a)
{
    if(a<2l)
        return false;
    if(a==2l)
        return true;
    for(long i=2;i*i<=a;i++)
        if(a%i==0)
            return false;
    return true;
}


void twin_range(long n_start,long n_stop,int index,int processDiv)
{
    // l1+(0,1,...,999)+0*1000
    // l1+(0,1,...,999)+1*1000
    // l1+(0,1,...,999)+2*1000
    // ...

    count=0;
    const long chunks=1000;
    long r_begin=0,k=0;
    for(long i=0;r_begin<=n_stop;i++)
    {
        r_begin=n_start+(i*processDiv+index)*chunks;
        for(k=r_begin;(k<r_begin+chunks) && (k<=n_stop);k++)
        {
            if(is_prime(k) && is_prime(k+2))
            {
                count++;
            }
        }
    }

    std::cout
        <<"Thread "<<index<<" finished."
        <<std::endl<<std::flush;
    return count;
}

推荐答案

请考虑当 last 线程完成检查其数字范围时，程序将完成.也许某些线程比其他线程更快?

Consider that your program will finish when the last thread has finished checking its range of numbers. Perhaps some threads are faster than others?

is_prime()需要多长时间才能确定偶数为质数?它将在第一次迭代中找到它.查找奇数的素数将需要至少两次迭代，如果a是素数，则可能需要多达sqrt(a)次迭代. is_prime()当给定的质数比偶数大时会慢得多！

How long does is_prime() take to determine that an even number is prime? It will find this on the first iteration. Finding the primality of an odd number will take at least two iterations and possibly up to sqrt(a) iterations if a is prime. is_prime() will be very much slower when it is given a large prime than an even number!

在两个线程的情况下，一个线程将检查100000000、100000002、100000004等的素数，而另一个线程将检查100000001、100000003、100000005等的素数.一个线程检查所有偶数，而另一个检查所有偶数.奇数(包括所有那些慢质数！).

In your two thread case, one thread will check the primality of 100000000, 100000002, 100000004, etc. while the other thread will check 100000001, 100000003, 100000005, etc. One thread checks all the even numbers while the other checks all the odd numbers (including all those slow primes!).

让线程在完成时打印出("Thread at %ld done", l1)，并且我认为您会发现某些线程比其他线程快得多，这是由于您在线程之间划分域的方式所致.偶数个线程会将所有偶数值赋予同一线程，从而导致分区特别差，这就是为什么偶数线程数比奇数慢的原因.

Have your threads print out ("Thread at %ld done", l1) when they finish, and I think you will find that some threads are very much faster than others, due to the way you are dividing the domain between the threads. An even number of threads will give all even values to the same thread(s), resulting in a particularly poor partitioning, which is why your even thread numbers are slower than the odd.

这将是一部很棒的XKCD风格的喜剧. 我们需要检查所有这些数字以找到素数！手工！" 好吧，我来检查偶数，你赔率."

This would make a great XKCD-esque comic. "We need to check all these numbers to find primes! By hand!" "Ok, I'll check the evens, you do the odds."

您真正的问题是，像您所做的那样，固定域分解要求每个分区花费相同的时间才能达到最佳状态.

Your real problem here is that a fixed domain decomposition like you have done requires that each partition take the same amount of time to be optimal.

解决此问题的方法是动态地进行分区.常用的模式涉及一个工作线程池，这些工作线程以块为单位请求工作.如果该块与线程要完成的总工作量相比很小，那么所有线程将在相似的时间内完成其工作.

The way to solve this is to dynamically do the partitioning. A pattern commonly used involves a pool of worker threads that request work in chunks. If the chunk is small compared to the total work a thread will do, then all threads will finish their work in a similar amount of time.

对于您的问题，您可能具有互斥保护的全局数据集start_number, stop_number, total_twins.每个线程将保存start_number，然后再将其全局值增加chunk_size.然后，它搜索范围[saved_start_number, saved_start_number+chunk_size)，完成后将找到的双胞胎的数量添加到全局total_twins中.辅助线程将继续执行此操作，直到start_number >= stop_number.对全局变量的访问使用互斥量进行保护.一个人必须调整块的大小，以从获取块的成本和互斥锁上的争用与空闲的工作线程的低效率(这些空闲的工作线程没有更多块可分配，而另一个线程仍在处理最后一个块)的效率上来限制效率低下.如果您使用原子增量来请求块，则块大小可能会与单个值一样小，但是如果在分布式计算系统中需要网络请求，则块大小将需要大得多.这是其工作原理的概述.

For your problem, you could have a mutex protected global data set start_number, stop_number, total_twins. Each thread will save start_number before incrementing its global value by chunk_size. Then it searches the range [saved_start_number, saved_start_number+chunk_size), adding the number of twins found to the global total_twins when done. The worker threads keep doing this until start_number >= stop_number. Access to the globals use the mutex for protection. One has to adjust the chunk size to limit inefficiency from the cost of getting a chunk and contention on the mutex vs the inefficiency of idle worker threads having no more chunks to allocate while another thread is still working on it's last chunk. If you used an atomic increment to request a chunk, maybe the chunk size could be as small as a single value, but if one needed a network request in a distributed computing system then the chunk size would need to be much larger. That's the overview of how it works.

顺便说一句，您的is_prime()测试非常幼稚并且非常慢.如果数字不能被2整除，可以将其除以4吗?一个人可以做得更好！

BTW, your is_prime() test is naive and exceedingly slow. If a number was not divisible by 2, can it be divided by 4? One can do much better!

这篇关于多线程基准的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

多线程基准 [英] Multi-threading benchmark

问题描述

推荐答案

相关文章

C/C++开发最新文章

热门教程

热门工具

登录关闭

多线程基准 [英] Multi-threading benchmark

问题描述

推荐答案

相关文章

C/C++开发最新文章

热门教程

热门工具

登录 关闭

登录关闭