OpenMP并行for循环,性能几乎没有提高 [英] OpenMP Parallel for-loop showing little performance increase

查看:166
本文介绍了OpenMP并行for循环,性能几乎没有提高的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在学习如何在C语言中使用OpenMP,并且作为HelloWorld练习,我正在编写一个计算素数的程序.然后,我将其并行化如下:

I am in the process of learning how to use OpenMP in C, and as a HelloWorld exercise I am writing a program to count primes. I then parallelise this as follows:

int numprimes = 0;
#pragma omp parallel for reduction (+:numprimes)
for (i = 1; i <= n; i++)
{
    if (is_prime(i) == true)
        numprimes ++;
}

我使用 gcc -g -Wall -fopenmp -o primes primes.c -lm ( -lm math.h 我正在使用的函数).然后,我在Intel®Core™2 Duo CPU E8400 @ 3.00GHz×2 上运行此代码,并且正如预期的那样,性能要优于串行程序.

I compile this code using gcc -g -Wall -fopenmp -o primes primes.c -lm (-lm for the math.h functions I am using). Then I run this code on an Intel® Core™2 Duo CPU E8400 @ 3.00GHz × 2, and as expected, the performance is better than for a serial program.

但是,当我尝试在功能更强大的计算机上运行此问题时,就会出现问题.(我也尝试过手动设置要与 num_threads 一起使用的线程数,但这并没有改变任何东西.)将所有素数计算为 10 000 000 会给我以下时间(使用 time ):

The problem, however, comes when I try to run this on a much more powerful machine. (I have also tried to manually set the number of threads to use with num_threads, but this did not change anything.) Counting all the primes up to 10 000 000 gives me the following times (using time):

8核机器:

real    0m8.230s
user    0m50.425s
sys     0m0.004s

双核计算机:

real    0m10.846s
user    0m17.233s
sys     0m0.004s

这种模式继续计数更多的质数,具有更多内核的计算机显示出轻微的性能提升,但不如我期望的那样拥有更多的可用内核那么多.(我希望内核增加4倍,意味着运行时间减少近4倍?)

And this pattern continues for counting more primes, the machine with more cores shows a slight performance increase, but not as much as I would expect for having so many more cores available. (I would expect 4 times more cores to imply almost 4 times less running time?)

计数素数最多为 500000000 :

8核机器:

real    1m29.056s
user    8m11.695s
sys     0m0.017s

双核计算机:

real    1m51.119s
user    2m50.519s
sys     0m0.060s

如果有人可以为我澄清这一点,将不胜感激.

If anyone can clarify this for me, it would be much appreciated.

编辑

这是我的主要检查功能.

This is my prime-checking function.

static int is_prime(int n)
{
  /* handle special cases */
  if      (n == 0) return 0;
  else if (n == 1) return 0;
  else if (n == 2) return 1;

  int i;
  for(i=2;i<=(int)(sqrt((double) n));i++)
    if (n%i==0) return 0;

  return 1;
}

推荐答案

之所以发生这种性能,是因为:

This performance is happening because:

  1. is_prime(i)花费的时间越长, i 所获得的时间就越长,并且
  2. 您的OpenMP实现默认情况下对 parallel for 构造使用静态调度,而没有 schedule 子句,即将for循环分割为相等大小的连续块.
  1. is_prime(i) takes longer the higher i gets, and
  2. Your OpenMP implementation uses static scheduling by default for parallel for constructs without the schedule clause, i.e. it chops the for loop into equal sized contiguous chunks.

换句话说,编号最高的线程正在执行所有最困难的操作.

In other words, the highest-numbered thread is doing all of the hardest operations.

使用 schedule 子句显式选择一种更合适的调度类型,可以使您在线程之间公平地分配工作.

Explicitly selecting a more appropriate scheduling type with the schedule clause allows you to divide work among the threads fairly.

此版本将更好地划分工作:

This version will divide the work better:

int numprimes = 0;
#pragma omp parallel for schedule(dynamic, 1) reduction(+:numprimes) 
for (i = 1; i <= n; i++)
{
    if (is_prime(i) == true)
        numprimes ++;
}

有关调度语法的信息,可以通过 MSDN获得维基百科.

Information on scheduling syntax is available via MSDN and Wikipedia.

进度表(动态,1)可能不是最佳的,因为高效能标记在他的答案中注明.在 OpenMP中,对调度粒度进行了更深入的讨论.

schedule(dynamic, 1) may not be optimal, as High Performance Mark notes in his answer. There is a more in-depth discussion of scheduling granularity in this OpenMP wihtepaper.

也感谢 Jens Gustedt

Thanks also to Jens Gustedt and Mahmoud Fayez for contributing to this answer.

这篇关于OpenMP并行for循环,性能几乎没有提高的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆