为什么OpenMP在Ubuntu 12.04下比串口版慢 [英] Why OpenMP under ubuntu 12.04 is slower than serial version

查看:140
本文介绍了为什么OpenMP在Ubuntu 12.04下比串口版慢的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我已阅读关于此主题的其他问题。
然而,他们没有解决我的问题。



我写的代码如下,我得到 pthread 版本和 omp 版本都比串行版本慢。我很困惑。



在环境下编译:

  12.04 64bit 3.2.0-60-generic 
g ++(Ubuntu 4.8.1-2ubuntu1〜12.04)4.8.1

CPU:2
在线CPU s $列表:0,1
每个核心线程数:1
供应商ID:AuthenticAMD
CPU系列:18
型号:1
步进:0
CPU MHz:800.000
BogoMIPS:3593.36
L1d高速缓存:64K
L1i高速缓存:64K
L2高速缓存:512K
NUMA节点0 CPU: ,1

编译命令:



code> g ++ -std = c ++ 11 ./eg001.cpp -fopenmp

  #include< cmath> 
#include< cstdio>
#include< ctime>
#include< omp.h>
#include< pthread.h>

#define NUM_THREADS 5
const int sizen = 256000000;

struct Data {
double * pSinTable;
long tid;
};

void * compute(void * p){
Data * pDt =(Data *)p;
const int start = sizen * pDt-> tid / NUM_THREADS;
const int end = sizen *(pDt-> tid + 1)/ NUM_THREADS;
for(int n = start; n pDt-> pSinTable [n] = std :: sin(2 * M_PI * n / sizen)
}
pthread_exit(nullptr);
}

int main()
{
double * sinTable = new double [sizen];
pthread_t threads [NUM_THREADS];
pthread_attr_t attr;
pthread_attr_init(& attr);
pthread_attr_setdetachstate(& attr,PTHREAD_CREATE_JOINABLE);

clock_t start,finish;

start = clock();
int rc;
Data dt [NUM_THREADS]
for(int i = 0; i dt [i] .pSinTable = sinTable;
dt [i] .tid = i;
rc = pthread_create(& threads [i],& attr,compute,& dt [i]);
} // for
pthread_attr_destroy(& attr);
for(int i = 0; i rc = pthread_join(threads [i],nullptr);
} // for
finish = clock();
printf(from pthread:%lf \\\
,(double)(finish-start)/ CLOCKS_PER_SEC);

delete sinTable;
sinTable = new double [sizen];

start = clock();
#pragma omp parallel for
for(int n = 0; n sinTable [n] = std :: sin(2 * M_PI * n / sizen );
finish = clock();
printf(from omp:%lf \\\
,(double)(finish-start)/ CLOCKS_PER_SEC);

delete sinTable;
sinTable = new double [sizen];

start = clock();
for(int n = 0; n sinTable [n] = std :: sin(2 * M_PI * n / sizen)
finish = clock();
printf(from serial:%lf \\\
,(double)(finish-start)/ CLOCKS_PER_SEC);

delete sinTable;

pthread_exit(nullptr);
return 0;
}

输出:

  from pthread:21.150000 
从omp:20.940000
从序列:20.800000

我不知道是否是我的代码的问题所以我使用pthread做同样的事情。



但是,我完全错了,我不知道是否可能是Ubuntu的OpenMP / pthread的问题。



我有一个朋友谁有AMD CPU和Ubuntu 12.04以及有同样的问题,所以我可能有一些理由相信问题不仅限于我。



如果任何人有与我相同的问题,或有一些线索的问题,






如果代码不够好,我运行了一个基准,我在这里粘贴结果: / p>

http://pastebin.com/RquLPREc p>

基准网址: http://www.cs.kent.edu/~farrell/mc08/lectures/progs/openmp/microBenchmarks/src/download.html






新信息:



我在VS2012上运行窗口(无pthread版本) 。



我使用1/10的sizen,因为windows不允许我分配那个内存的大内存,结果是:

 从omp:1.004 
从序列:1.420
从FreeNickName:735(这一个是由@FreeNickName的建议改进)

这表明它可能是 Ubuntu OS









问题可通过使用 omp_get_wtime 函数,可在操作系统之间移植。请参阅 Hristo Iliev 的回答。






有争议的主题由 FreeNickName



(抱歉,我需要在Ubuntu上测试它,因为Windows是我的一个朋友'。)



- 1--从删除更改为 delete [] :(但没有memset)( - std = c ++ 11 -fopenmp)

  13.491405 
从omp:13.023099
从序列:20.665132
从FreeNickName:12.022501


$ b b

- 2--紧跟在new之后的memset:(-std = c ++ 11 -fopenmp)

 从pthread:13.996505 
从omp:13.192444
从序列:19.882127
从FreeNickName:12.541723

- 3--紧跟在新的之后的memset:(-std = c ++ 11 -fopenmp -march = native -O2)

  from pthread:11.886978 
从omp:11.351801
从serial:17.002865
从FreeNickName:11.198779
pre>

- 4--紧跟在new之后的memset,并将FreeNickName的版本放在OMP之前的版本:(-std = c ++ 11 -fopenmp -march = native -O2)

  from pthread:11.831127 
from FreeNickName:11.571595
from omp:11.932814
从序列:16.976979

- 5--紧跟在new之后的memset,并将FreeNickName的版本OMP版本,并将 NUM_THREADS 设置为5而不是2(我是双核)。

  from pthread:9.451775 
从FreeNickName:9.385366
从omp:11.854656
从序列:16.960101


解决方案

OpenMP在你的case没有什么问题。



使用 clock()可以衡量多线程的性能应用程序在Linux(和大多数其他类Unix操作系统)是一个错误,因为它不返回挂钟(实时)时间,而是累积的CPU时间为所有进程线程(和一些Unix风格,甚至累积的CPU时间所有子进程)。您的并行代码在Windows上显示更好的性能,因为 clock()返回实际时间而不是累积的CPU时间。



防止这种差异的最好方法是使用便携式OpenMP定时器例程 omp_get_wtime()

  double start = omp_get_wtime(); 
#pragma omp parallel for
for(int n = 0; n sinTable [n] = std :: sin(2 * M_PI * n / sizen );
double finish = omp_get_wtime();
printf(from omp:%lf \\\
,finish - start);

对于非OpenMP应用程序,应使用 clock_gettime c $ c $ CLOCK_REALTIME 时钟:

  struct timespec start,完; 
clock_gettime(CLOCK_REALTIME,& start);
#pragma omp parallel for
for(int n = 0; n sinTable [n] = std :: sin(2 * M_PI * n / sizen );
clock_gettime(CLOCK_REALTIME,& finish);
printf(from omp:%lf \\\
,(finish.tv_sec + 1.e-9 * finish.tv_nsec) -
(start.tv_sec + 1.e-9 * start。 tv_nsec));


I've read some other questions on this topic. However, they didn't solve my problem anyway.

I wrote the code as following and I got pthread version and omp version both slower than the serial version. I'm very confused.

Compiled under environment:

Ubuntu 12.04 64bit 3.2.0-60-generic
g++ (Ubuntu 4.8.1-2ubuntu1~12.04) 4.8.1

CPU(s):                2
On-line CPU(s) list:   0,1
Thread(s) per core:    1
Vendor ID:             AuthenticAMD
CPU family:            18
Model:                 1
Stepping:              0
CPU MHz:               800.000
BogoMIPS:              3593.36
L1d cache:             64K
L1i cache:             64K
L2 cache:              512K
NUMA node0 CPU(s):     0,1

Compile command:

g++ -std=c++11 ./eg001.cpp -fopenmp

#include <cmath>
#include <cstdio>
#include <ctime>
#include <omp.h>
#include <pthread.h>

#define NUM_THREADS 5
const int sizen = 256000000;

struct Data {
    double * pSinTable;
    long tid;
};

void * compute(void * p) {
    Data * pDt = (Data *)p;
    const int start = sizen * pDt->tid/NUM_THREADS;
    const int end = sizen * (pDt->tid + 1)/NUM_THREADS;
    for(int n = start; n < end; ++n) {
        pDt->pSinTable[n] = std::sin(2 * M_PI * n / sizen);
    }
    pthread_exit(nullptr);
}

int main()
{
    double * sinTable = new double[sizen];
    pthread_t threads[NUM_THREADS];
    pthread_attr_t attr;
    pthread_attr_init(&attr);
    pthread_attr_setdetachstate(&attr, PTHREAD_CREATE_JOINABLE);

    clock_t start, finish;

    start = clock();
    int rc;
    Data dt[NUM_THREADS];
    for(int i = 0; i < NUM_THREADS; ++i) {
        dt[i].pSinTable = sinTable;
        dt[i].tid = i;
        rc = pthread_create(&threads[i], &attr, compute, &dt[i]);
    }//for
    pthread_attr_destroy(&attr);
    for(int i = 0; i < NUM_THREADS; ++i) {
        rc = pthread_join(threads[i], nullptr);
    }//for
    finish = clock();
    printf("from pthread: %lf\n", (double)(finish - start)/CLOCKS_PER_SEC);

    delete sinTable;
    sinTable = new double[sizen];

    start = clock();
#   pragma omp parallel for
    for(int n = 0; n < sizen; ++n)
        sinTable[n] = std::sin(2 * M_PI * n / sizen);
    finish = clock();
    printf("from omp: %lf\n", (double)(finish - start)/CLOCKS_PER_SEC);

    delete sinTable;
    sinTable = new double[sizen];

    start = clock();
    for(int n = 0; n < sizen; ++n)
        sinTable[n] = std::sin(2 * M_PI * n / sizen);
    finish = clock();
    printf("from serial: %lf\n", (double)(finish - start)/CLOCKS_PER_SEC);

    delete sinTable;

    pthread_exit(nullptr);
    return 0;
}

Output:

from pthread: 21.150000
from omp: 20.940000
from serial: 20.800000

I wonder whether it was my code's problem so I used pthread to do the same thing.

However, I'm totally wrong, and I wonder whether it might be Ubuntu's problem on OpenMP/pthread.

I have a friend who has AMD CPU and Ubuntu 12.04 as well, and got the same problem there, so I might have some reason to believe that the problem is not limited to only me.

If anyone has the same problem as me, or has some clue on the problem, thanks in advance.


If the code is not good enough, I ran a benchmark and I pasted the result here:

http://pastebin.com/RquLPREc

The benchmark url: http://www.cs.kent.edu/~farrell/mc08/lectures/progs/openmp/microBenchmarks/src/download.html


New infomation:

I ran the code on windows (without pthread version) with VS2012.

I used 1/10 of sizen because windows does not allow me to allocate that great trunk of memory where the results are:

from omp: 1.004
from serial: 1.420
from FreeNickName: 735 (this one is the suggestion improvement by @FreeNickName)

Does this indicate that it could be a problem of Ubuntu OS ??



Problem is solved by using omp_get_wtime function that is portable among Operating Systems. See the answer by Hristo Iliev.


Some tests about the controversial topic by FreeNickName.

(Sorry I need to test it on Ubuntu cause the windows was one of my friends'.)

--1-- Change from delete to delete [] : (but without memset)(-std=c++11 -fopenmp)

from pthread: 13.491405
from omp: 13.023099
from serial: 20.665132
from FreeNickName: 12.022501

--2-- With memset immediately after new: (-std=c++11 -fopenmp)

from pthread: 13.996505
from omp: 13.192444
from serial: 19.882127
from FreeNickName: 12.541723

--3-- With memset immediately after new: (-std=c++11 -fopenmp -march=native -O2)

from pthread: 11.886978
from omp: 11.351801
from serial: 17.002865
from FreeNickName: 11.198779

--4-- With memset immediately after new, and put FreeNickName's version before OMP for version: (-std=c++11 -fopenmp -march=native -O2)

from pthread: 11.831127
from FreeNickName: 11.571595
from omp: 11.932814
from serial: 16.976979

--5-- With memset immediately after new, and put FreeNickName's version before OMP for version, and set NUM_THREADS to 5 instead of 2 (I'm dual core).

from pthread: 9.451775
from FreeNickName: 9.385366
from omp: 11.854656
from serial: 16.960101

解决方案

There is nothing wrong with OpenMP in your case. What is wrong is the way you measure the elapsed time.

Using clock() to measure the performance of multithreaded applications on Linux (and most other Unix-like OSes) is a mistake since it does not return the wall-clock (real) time but instead the accumulated CPU time for all process threads (and on some Unix flavours even the accumulated CPU time for all child processes). Your parallel code shows better performance on Windows since there clock() returns the real time and not the accumulated CPU time.

The best way to prevent such discrepancies is to use the portable OpenMP timer routine omp_get_wtime():

double start = omp_get_wtime();
#pragma omp parallel for
for(int n = 0; n < sizen; ++n)
    sinTable[n] = std::sin(2 * M_PI * n / sizen);
double finish = omp_get_wtime();
printf("from omp: %lf\n", finish - start);

For non-OpenMP applications, you should use clock_gettime() with the CLOCK_REALTIME clock:

struct timespec start, finish;
clock_gettime(CLOCK_REALTIME, &start);
#pragma omp parallel for
for(int n = 0; n < sizen; ++n)
    sinTable[n] = std::sin(2 * M_PI * n / sizen);
clock_gettime(CLOCK_REALTIME, &finish);
printf("from omp: %lf\n", (finish.tv_sec + 1.e-9 * finish.tv_nsec) -
                          (start.tv_sec + 1.e-9 * start.tv_nsec));

这篇关于为什么OpenMP在Ubuntu 12.04下比串口版慢的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆