为什么OpenMP在Ubuntu 12.04下比串口版慢 [英] Why OpenMP under ubuntu 12.04 is slower than serial version
问题描述
我已阅读关于此主题的其他问题。
然而,他们没有解决我的问题。
我写的代码如下,我得到 pthread
版本和 omp
版本都比串行版本慢。我很困惑。
在环境下编译:
12.04 64bit 3.2.0-60-generic
g ++(Ubuntu 4.8.1-2ubuntu1〜12.04)4.8.1
CPU:2
在线CPU s $列表:0,1
每个核心线程数:1
供应商ID:AuthenticAMD
CPU系列:18
型号:1
步进:0
CPU MHz:800.000
BogoMIPS:3593.36
L1d高速缓存:64K
L1i高速缓存:64K
L2高速缓存:512K
NUMA节点0 CPU: ,1
编译命令:
code> g ++ -std = c ++ 11 ./eg001.cpp -fopenmp
#include< cmath>
#include< cstdio>
#include< ctime>
#include< omp.h>
#include< pthread.h>
#define NUM_THREADS 5
const int sizen = 256000000;
struct Data {
double * pSinTable;
long tid;
};
void * compute(void * p){
Data * pDt =(Data *)p;
const int start = sizen * pDt-> tid / NUM_THREADS;
const int end = sizen *(pDt-> tid + 1)/ NUM_THREADS;
for(int n = start; n pDt-> pSinTable [n] = std :: sin(2 * M_PI * n / sizen)
}
pthread_exit(nullptr);
}
int main()
{
double * sinTable = new double [sizen];
pthread_t threads [NUM_THREADS];
pthread_attr_t attr;
pthread_attr_init(& attr);
pthread_attr_setdetachstate(& attr,PTHREAD_CREATE_JOINABLE);
clock_t start,finish;
start = clock();
int rc;
Data dt [NUM_THREADS]
for(int i = 0; i dt [i] .pSinTable = sinTable;
dt [i] .tid = i;
rc = pthread_create(& threads [i],& attr,compute,& dt [i]);
} // for
pthread_attr_destroy(& attr);
for(int i = 0; i rc = pthread_join(threads [i],nullptr);
} // for
finish = clock();
printf(from pthread:%lf \\\
,(double)(finish-start)/ CLOCKS_PER_SEC);
delete sinTable;
sinTable = new double [sizen];
start = clock();
#pragma omp parallel for
for(int n = 0; n sinTable [n] = std :: sin(2 * M_PI * n / sizen );
finish = clock();
printf(from omp:%lf \\\
,(double)(finish-start)/ CLOCKS_PER_SEC);
delete sinTable;
sinTable = new double [sizen];
start = clock();
for(int n = 0; n sinTable [n] = std :: sin(2 * M_PI * n / sizen)
finish = clock();
printf(from serial:%lf \\\
,(double)(finish-start)/ CLOCKS_PER_SEC);
delete sinTable;
pthread_exit(nullptr);
return 0;
}
输出:
from pthread:21.150000
从omp:20.940000
从序列:20.800000
我不知道是否是我的代码的问题所以我使用pthread做同样的事情。
但是,我完全错了,我不知道是否可能是Ubuntu的OpenMP / pthread的问题。
我有一个朋友谁有AMD CPU和Ubuntu 12.04以及有同样的问题,所以我可能有一些理由相信问题不仅限于我。
如果任何人有与我相同的问题,或有一些线索的问题,
如果代码不够好,我运行了一个基准,我在这里粘贴结果: / p>
http://pastebin.com/RquLPREc p>
基准网址: http://www.cs.kent.edu/~farrell/mc08/lectures/progs/openmp/microBenchmarks/src/download.html
新信息:
我在VS2012上运行窗口(无pthread版本) 。
我使用1/10的sizen,因为windows不允许我分配那个内存的大内存,结果是:
从omp:1.004
从序列:1.420
从FreeNickName:735(这一个是由@FreeNickName的建议改进)
这表明它可能是 Ubuntu OS
问题可通过使用 omp_get_wtime
函数,可在操作系统之间移植。请参阅 Hristo Iliev
的回答。
有争议的主题由 FreeNickName
。
(抱歉,我需要在Ubuntu上测试它,因为Windows是我的一个朋友'。)
- 1--从删除
更改为 delete []
:(但没有memset)( - std = c ++ 11 -fopenmp)
13.491405
从omp:13.023099
从序列:20.665132
从FreeNickName:12.022501
$ b b
- 2--紧跟在new之后的memset:(-std = c ++ 11 -fopenmp)
从pthread:13.996505
从omp:13.192444
从序列:19.882127
从FreeNickName:12.541723
- 3--紧跟在新的之后的memset:(-std = c ++ 11 -fopenmp -march = native -O2)
from pthread:11.886978
pre>
从omp:11.351801
从serial:17.002865
从FreeNickName:11.198779
- 4--紧跟在new之后的memset,并将FreeNickName的版本放在OMP之前的版本:(-std = c ++ 11 -fopenmp -march = native -O2)
from pthread:11.831127
from FreeNickName:11.571595
from omp:11.932814
从序列:16.976979
- 5--紧跟在new之后的memset,并将FreeNickName的版本OMP版本,并将
NUM_THREADS
设置为5而不是2(我是双核)。from pthread:9.451775
从FreeNickName:9.385366
从omp:11.854656
从序列:16.960101
解决方案OpenMP在你的case没有什么问题。
使用
clock()
可以衡量多线程的性能应用程序在Linux(和大多数其他类Unix操作系统)是一个错误,因为它不返回挂钟(实时)时间,而是累积的CPU时间为所有进程线程(和一些Unix风格,甚至累积的CPU时间所有子进程)。您的并行代码在Windows上显示更好的性能,因为clock()
返回实际时间而不是累积的CPU时间。
防止这种差异的最好方法是使用便携式OpenMP定时器例程
omp_get_wtime()
:double start = omp_get_wtime();
#pragma omp parallel for
for(int n = 0; nsinTable [n] = std :: sin(2 * M_PI * n / sizen );
double finish = omp_get_wtime();
printf(from omp:%lf \\\
,finish - start);
对于非OpenMP应用程序,应使用
clock_gettime c $ c $
与
CLOCK_REALTIME
时钟:struct timespec start,完;
clock_gettime(CLOCK_REALTIME,& start);
#pragma omp parallel for
for(int n = 0; nsinTable [n] = std :: sin(2 * M_PI * n / sizen );
clock_gettime(CLOCK_REALTIME,& finish);
printf(from omp:%lf \\\
,(finish.tv_sec + 1.e-9 * finish.tv_nsec) -
(start.tv_sec + 1.e-9 * start。 tv_nsec));
I've read some other questions on this topic. However, they didn't solve my problem anyway.
I wrote the code as following and I got
pthread
version andomp
version both slower than the serial version. I'm very confused.Compiled under environment:
Ubuntu 12.04 64bit 3.2.0-60-generic g++ (Ubuntu 4.8.1-2ubuntu1~12.04) 4.8.1 CPU(s): 2 On-line CPU(s) list: 0,1 Thread(s) per core: 1 Vendor ID: AuthenticAMD CPU family: 18 Model: 1 Stepping: 0 CPU MHz: 800.000 BogoMIPS: 3593.36 L1d cache: 64K L1i cache: 64K L2 cache: 512K NUMA node0 CPU(s): 0,1
Compile command:
g++ -std=c++11 ./eg001.cpp -fopenmp
#include <cmath> #include <cstdio> #include <ctime> #include <omp.h> #include <pthread.h> #define NUM_THREADS 5 const int sizen = 256000000; struct Data { double * pSinTable; long tid; }; void * compute(void * p) { Data * pDt = (Data *)p; const int start = sizen * pDt->tid/NUM_THREADS; const int end = sizen * (pDt->tid + 1)/NUM_THREADS; for(int n = start; n < end; ++n) { pDt->pSinTable[n] = std::sin(2 * M_PI * n / sizen); } pthread_exit(nullptr); } int main() { double * sinTable = new double[sizen]; pthread_t threads[NUM_THREADS]; pthread_attr_t attr; pthread_attr_init(&attr); pthread_attr_setdetachstate(&attr, PTHREAD_CREATE_JOINABLE); clock_t start, finish; start = clock(); int rc; Data dt[NUM_THREADS]; for(int i = 0; i < NUM_THREADS; ++i) { dt[i].pSinTable = sinTable; dt[i].tid = i; rc = pthread_create(&threads[i], &attr, compute, &dt[i]); }//for pthread_attr_destroy(&attr); for(int i = 0; i < NUM_THREADS; ++i) { rc = pthread_join(threads[i], nullptr); }//for finish = clock(); printf("from pthread: %lf\n", (double)(finish - start)/CLOCKS_PER_SEC); delete sinTable; sinTable = new double[sizen]; start = clock(); # pragma omp parallel for for(int n = 0; n < sizen; ++n) sinTable[n] = std::sin(2 * M_PI * n / sizen); finish = clock(); printf("from omp: %lf\n", (double)(finish - start)/CLOCKS_PER_SEC); delete sinTable; sinTable = new double[sizen]; start = clock(); for(int n = 0; n < sizen; ++n) sinTable[n] = std::sin(2 * M_PI * n / sizen); finish = clock(); printf("from serial: %lf\n", (double)(finish - start)/CLOCKS_PER_SEC); delete sinTable; pthread_exit(nullptr); return 0; }
Output:
from pthread: 21.150000 from omp: 20.940000 from serial: 20.800000
I wonder whether it was my code's problem so I used pthread to do the same thing.
However, I'm totally wrong, and I wonder whether it might be Ubuntu's problem on OpenMP/pthread.
I have a friend who has AMD CPU and Ubuntu 12.04 as well, and got the same problem there, so I might have some reason to believe that the problem is not limited to only me.
If anyone has the same problem as me, or has some clue on the problem, thanks in advance.
If the code is not good enough, I ran a benchmark and I pasted the result here:
The benchmark url: http://www.cs.kent.edu/~farrell/mc08/lectures/progs/openmp/microBenchmarks/src/download.html
New infomation:
I ran the code on windows (without pthread version) with VS2012.
I used 1/10 of sizen because windows does not allow me to allocate that great trunk of memory where the results are:
from omp: 1.004 from serial: 1.420 from FreeNickName: 735 (this one is the suggestion improvement by @FreeNickName)
Does this indicate that it could be a problem of
Ubuntu OS
??
Problem is solved by using
omp_get_wtime
function that is portable among Operating Systems. See the answer byHristo Iliev
.
Some tests about the controversial topic by
FreeNickName
.(Sorry I need to test it on Ubuntu cause the windows was one of my friends'.)
--1-- Change from
delete
todelete []
: (but without memset)(-std=c++11 -fopenmp)from pthread: 13.491405 from omp: 13.023099 from serial: 20.665132 from FreeNickName: 12.022501
--2-- With memset immediately after new: (-std=c++11 -fopenmp)
from pthread: 13.996505 from omp: 13.192444 from serial: 19.882127 from FreeNickName: 12.541723
--3-- With memset immediately after new: (-std=c++11 -fopenmp -march=native -O2)
from pthread: 11.886978 from omp: 11.351801 from serial: 17.002865 from FreeNickName: 11.198779
--4-- With memset immediately after new, and put FreeNickName's version before OMP for version: (-std=c++11 -fopenmp -march=native -O2)
from pthread: 11.831127 from FreeNickName: 11.571595 from omp: 11.932814 from serial: 16.976979
--5-- With memset immediately after new, and put FreeNickName's version before OMP for version, and set
NUM_THREADS
to 5 instead of 2 (I'm dual core).from pthread: 9.451775 from FreeNickName: 9.385366 from omp: 11.854656 from serial: 16.960101
解决方案There is nothing wrong with OpenMP in your case. What is wrong is the way you measure the elapsed time.
Using
clock()
to measure the performance of multithreaded applications on Linux (and most other Unix-like OSes) is a mistake since it does not return the wall-clock (real) time but instead the accumulated CPU time for all process threads (and on some Unix flavours even the accumulated CPU time for all child processes). Your parallel code shows better performance on Windows since thereclock()
returns the real time and not the accumulated CPU time.The best way to prevent such discrepancies is to use the portable OpenMP timer routine
omp_get_wtime()
:double start = omp_get_wtime(); #pragma omp parallel for for(int n = 0; n < sizen; ++n) sinTable[n] = std::sin(2 * M_PI * n / sizen); double finish = omp_get_wtime(); printf("from omp: %lf\n", finish - start);
For non-OpenMP applications, you should use
clock_gettime()
with theCLOCK_REALTIME
clock:struct timespec start, finish; clock_gettime(CLOCK_REALTIME, &start); #pragma omp parallel for for(int n = 0; n < sizen; ++n) sinTable[n] = std::sin(2 * M_PI * n / sizen); clock_gettime(CLOCK_REALTIME, &finish); printf("from omp: %lf\n", (finish.tv_sec + 1.e-9 * finish.tv_nsec) - (start.tv_sec + 1.e-9 * start.tv_nsec));
这篇关于为什么OpenMP在Ubuntu 12.04下比串口版慢的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!