分割工作,多个线程需要更多的时间,为什么呢? [英] Dividing work to more threads takes more time, why?

查看:129
本文介绍了分割工作,多个线程需要更多的时间,为什么呢?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个计算小的C程序的 PI 的使用 monte-卡罗 -simulation这基本上只是测试了一个随机点[X,Y]如果它是内部还是一个圈子之外。

I have a small C program which calculates pi using a monte-carlo-simulation which basically just tests for a random point [x,y] if it's inside or outside a circle.

要逼近的 PI 的我必须使用大量的样本的 N 的有直接正比复杂的 O(N)的。因此,试图到n计算样本数量巨大,我实现了 POSIX线程 API来parallize的计算能力

To approximate pi I have to use a high number of samples n which has a direct proportional complexity of O(n). So trying to calculate a huge number of samples n, I implemented POSIX threads api to parallize the computational power.

我的code是这样的:

My code looks like this:

pthread_t worker[nthreads]; /* creates workers for each thread */
struct param aparam[nthreads]; /* struct param{ long* hits; long rounds; }; */
long nrounds = nsamples / nthreads; /* divide samples to subsets of equal rounds per thread */

for (int i = 0; i < nthreads; ++i) { /* loop to create threads */
    aparam[i].hits = 0;
    aparam[i].rounds = nrounds;
    pthread_create(&worker[i], NULL, calc_pi, &aparam[i]); /* calls calc_pi(void* vparam){}  */ 
}

long nhits = 0;
for (int j = 0; j < nthreads; ++j) { /* collects results */
    pthread_join(worker[j], NULL);
    nhits += (long)aparam[j].hits; /* counts hits inside the cicrle */
}

这是每个线程正在做:

And this is what each thread is doing:

void* calc_pi(void* vparam)
{ /* counts hits inside a circle */
    struct param *iparam;
    iparam = (struct param *) vparam;
    long hits = 0;
    float x, y, z;
    for (long i = 0; i < iparam->rounds; ++i) {
        x = (float)rand()/RAND_MAX;
        y = (float)rand()/RAND_MAX;
        z = x * x + y * y;
        if (z <= 1.f) /* circle radius of 1 */
            ++hits;
    }
    iparam->hits = (long*)hits;
    return NULL;
}

现在我有一个奇怪的观察。与同组样品的 N 的和越来越多的线程的 I 的这个计划需要,而不是更少的更多的时间

Now I have a strange observation. With the same set of samples n and with an increasing number of threads i this program takes more time instead of less.

下面是一些平均运行时间(可重放):

Here are some average run times (reproducable):

-------------------------------------------------
| Threads[1] | Samples[1] | Rounds[1] | Time[s] |
-------------------------------------------------
|        32  |  268435456 |   8388608 |    118  |
|        16  |  268435456 |  16777216 |    106  |
|         8  |  268435456 |  33554432 |    125  |
|         4  |  268435456 |  67108864 |    152  |
|         2  |  268435456 | 134217728 |     36  |
|         1  |  268435456 | 268435456 |     15  |
-------------------------------------------------

为什么两个线程做同样的工作比服用双倍的时间超过一个单线程更是实例?我的假设是两个线程划分工作应至少减少50%的时间。

Why is for instance two threads doing the same work taking more than double of the time than one single thread? My assumption is that two threads dividing the work should reduce the time by at least 50%.

用gcc编译4.9.1及以下标志:

Compiled with GCC 4.9.1 and the following flags:

gcc -O2 -std=gnu11 -pthread pipa.c -lpthread -o pipa

我的硬件是一个双Intel Xeon E5520(2个处理器,每个4核)@ 2.26 GHz的超线程禁用,运行Linux的科学与2.6.18内核。

My hardware is a Dual Intel Xeon E5520 (2 processors with each 4 cores) @ 2.26 GHz, hyperthreading disabled, running scientific linux with 2.6.18 kernel.

任何想法?

推荐答案

最昂贵的操作你的线程执行呼吁兰特()。在兰特()是一个天真,简单,和一般非MT可扩展功能(因为它保证了相同的种子生产的同一序列的随机数​​字)。我觉得兰特()里面的锁被序列化的所有线程。(*)

The most expensive operation your thread performs is calling rand(). The rand() is a naive, simplistic, and generally non-MT scalable function (since it guarantees for the same seed to produce the same sequence of random numbers). I think the lock inside the rand() is serializing all the threads.(*)

一个简单的技巧,以确认其是否是问题或没有,正在调试器启动程序,然后,好几次:暂停,捕捉线程的堆栈跟踪,继续。无论最常出现的踪迹,很有可能是瓶颈。

A simple trick to confirm whether it is the problem or not, is to start the program under debugger, then, several times: pause it, capture the stack trace of the threads, continue. Whatever appears most often in the stacktraces, very likely is the bottleneck.

(*)是什么使得它更慢是锁争会带来额外的性能损失的事实。此外,许多线程增加额外开销的进程调度和上下文切换。

(*) What makes it even slower is the fact that lock contention causes additional performance penalty. Also, the many threads add additional overhead of the process scheduling and the context switches.

这篇关于分割工作,多个线程需要更多的时间,为什么呢?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆