使用并行代码时,为什么我的计算机没有显示加速? [英] Why is my computer not showing a speedup when I use parallel code?

查看:93
本文介绍了使用并行代码时,为什么我的计算机没有显示加速?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

所以我意识到这个问题听起来很愚蠢(是的,我正在使用双核),但是我尝试了两个不同的库(Grand Central Dispatch和OpenMP),并且在使用clock()对带有和不带有使其平行的线,速度是相同的. (为了记录,它们都使用自己的并行形式).它们报告运行在不同的线程上,但是也许它们运行在相同的内核上?有什么办法可以检查吗? (这两个库都是针对C的,在较低的层次上我感到不舒服.)这太奇怪了.有什么想法吗?

So I realize this question sounds stupid (and yes I am using a dual core), but I have tried two different libraries (Grand Central Dispatch and OpenMP), and when using clock() to time the code with and without the lines that make it parallel, the speed is the same. (for the record they were both using their own form of parallel for). They report being run on different threads, but perhaps they are running on the same core? Is there any way to check? (Both libraries are for C, I'm uncomfortable at lower layers.) This is super weird. Any ideas?

推荐答案

添加了有关Grand Central Dispatch的详细信息,以响应OP注释.

Added detail for Grand Central Dispatch in response to OP comment.

虽然这里的其他答案通常很有用,但针对您问题的特定答案是,您不应该使用clock()来比较时间. clock()测量跨线程累加的CPU时间.当您在内核之间分配作业时,它至少占用CPU时间(通常由于线程开销而要多一些).在页上搜索clock(),以找到如果进程是多线程的,则为CPU时间所有进程的各个线程所消耗的都将被添加."

While the other answers here are useful in general, the specific answer to your question is that you shouldn't be using clock() to compare the timing. clock() measures CPU time which is added up across the threads. When you split a job between cores, it uses at least as much CPU time (usually a bit more due to threading overhead). Search for clock() on this page, to find "If process is multi-threaded, cpu time consumed by all individual threads of process are added."

只是作业是在线程之间分配的,因此您需要等待的总时间更少.您应该使用挂钟时间(挂钟上的时间). OpenMP提供了例程omp_get_wtime()来执行此操作.以以下例程为例:

It's just that the job is split between threads, so the overall time you have to wait is less. You should be using the wall time (the time on a wall clock). OpenMP provides a routine omp_get_wtime() to do it. Take the following routine as an example:

#include <omp.h>
#include <time.h>
#include <math.h>
#include <stdio.h>

int main(int argc, char *argv[]) {
    int i, nthreads;
    clock_t clock_timer;
    double wall_timer;
    for (nthreads = 1; nthreads <=8; nthreads++) {
        clock_timer = clock();
        wall_timer = omp_get_wtime();
        #pragma omp parallel for private(i) num_threads(nthreads)
        for (i = 0; i < 100000000; i++) cos(i);
        printf("%d threads: time on clock() = %.3f, on wall = %.3f\n", \
            nthreads, \
            (double) (clock() - clock_timer) / CLOCKS_PER_SEC, \
            omp_get_wtime() - wall_timer);
    }
}

结果是:

1 threads: time on clock() = 0.258, on wall = 0.258
2 threads: time on clock() = 0.256, on wall = 0.129
3 threads: time on clock() = 0.255, on wall = 0.086
4 threads: time on clock() = 0.257, on wall = 0.065
5 threads: time on clock() = 0.255, on wall = 0.051
6 threads: time on clock() = 0.257, on wall = 0.044
7 threads: time on clock() = 0.255, on wall = 0.037
8 threads: time on clock() = 0.256, on wall = 0.033

您可以看到clock()时间没有太大变化.我没有使用pragma时得到0.254,因此与一个完全不使用openMP相比,仅在一个线程中使用openMP的速度要慢一些,但是每个线程的使用时间会减少.

You can see that the clock() time doesn't change much. I get 0.254 without the pragma, so it's a little slower using openMP with one thread than not using openMP at all, but the wall time decreases with each thread.

例如,由于计算中的部分不是并行的,因此改进并不总是那么好(请参见 Amdahl's_law )或争用同一内存的不同线程.

The improvement won't always be this good due to, for example, parts of your calculation that aren't parallel (see Amdahl's_law) or different threads fighting over the same memory.

对于Grand Central Dispatch, GCD参考文件指出,GCD使用gettimeofday进行壁挂时间.因此,我创建了一个新的Cocoa App,并在applicationDidFinishLaunching中输入:

For Grand Central Dispatch, the GCD reference states, that GCD uses gettimeofday for wall time. So, I create a new Cocoa App, and in applicationDidFinishLaunching I put:

struct timeval t1,t2;
dispatch_queue_t queue = dispatch_get_global_queue(DISPATCH_QUEUE_PRIORITY_DEFAULT, 0);
for (int iterations = 1; iterations <= 8; iterations++) {
    int stride = 1e8/iterations;
    gettimeofday(&t1,0);
    dispatch_apply(iterations, queue, ^(size_t i) { 
        for (int j = 0; j < stride; j++) cos(j); 
    });
    gettimeofday(&t2,0);
    NSLog(@"%d iterations: on wall = %.3f\n",iterations, \
                t2.tv_sec+t2.tv_usec/1e6-(t1.tv_sec+t1.tv_usec/1e6));
}

并且我在控制台上得到以下结果:

and I get the following results on the console:

2010-03-10 17:33:43.022 GCDClock[39741:a0f] 1 iterations: on wall = 0.254
2010-03-10 17:33:43.151 GCDClock[39741:a0f] 2 iterations: on wall = 0.127
2010-03-10 17:33:43.236 GCDClock[39741:a0f] 3 iterations: on wall = 0.085
2010-03-10 17:33:43.301 GCDClock[39741:a0f] 4 iterations: on wall = 0.064
2010-03-10 17:33:43.352 GCDClock[39741:a0f] 5 iterations: on wall = 0.051
2010-03-10 17:33:43.395 GCDClock[39741:a0f] 6 iterations: on wall = 0.043
2010-03-10 17:33:43.433 GCDClock[39741:a0f] 7 iterations: on wall = 0.038
2010-03-10 17:33:43.468 GCDClock[39741:a0f] 8 iterations: on wall = 0.034

与我上面提到的大致相同.

which is about the same as I was getting above.

这是一个非常人为的例子.实际上,您需要确保将优化保持在-O0,否则编译器将意识到我们不保留任何计算,也完全不执行循环.另外,在两个示例中,我采用的cos的整数是不同的,但这不会对结果产生太大影响.有关如何正确执行dispatch_apply的信息,请参见联机帮助页上的STRIDE;以及在这种情况下为什么iterationsnum_threads大致可比的原因.

This is a very contrived example. In fact, you need to be sure to keep the optimization at -O0, or else the compiler will realize we don't keep any of the calculations and not do the loop at all. Also, the integer that I'm taking the cos of is different in the two examples, but that doesn't affect the results too much. See the STRIDE on the manpage for dispatch_apply for how to do it properly and for why iterations is broadly comparable to num_threads in this case.

我注意到雅各布的答案包括

I note that Jacob's answer includes

我使用omp_get_thread_num() 在我的并行循环中起作用 打印出它正在工作的核心 上...这样,您可以确定 它在两个内核上都运行.

I use the omp_get_thread_num() function within my parallelized loop to print out which core it's working on... This way you can be sure that it's running on both cores.

这是不正确的(已通过编辑部分修复).使用omp_get_thread_num()确实是确保代码是多线程的一种好方法,但是它不会显示它正在处理哪个核心",而是显示哪个线程.例如,以下代码:

which is not correct (it has been partly fixed by an edit). Using omp_get_thread_num() is indeed a good way to ensure that your code is multithreaded, but it doesn't show "which core it's working on", just which thread. For example, the following code:

#include <omp.h>
#include <stdio.h>

int main() {
    int i;
    #pragma omp parallel for private(i) num_threads(50)
    for (i = 0; i < 50; i++) printf("%d\n", omp_get_thread_num());
}

打印出它正在使用0到49线程,但是由于我只有八个内核,因此没有显示它正在使用哪个内核.通过查看活动监视器(OP提到GCD,因此必须在Mac上-转到Window/CPU Usage),您可以看到作业在核心之间切换,因此核心!=线程.

prints out that it's using threads 0 to 49, but this doesn't show which core it's working on, since I only have eight cores. By looking at the Activity Monitor (the OP mentioned GCD, so must be on a Mac - go Window/CPU Usage), you can see jobs switching between cores, so core != thread.

这篇关于使用并行代码时,为什么我的计算机没有显示加速?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆