Fortran LAPACK:使用DSYEV时CPU占用率高%sys - 没有并行 - 正常吗? [英] Fortran LAPACK: high CPU %sys usage with DSYEV - no parallelization - normal?

查看:369
本文介绍了Fortran LAPACK:使用DSYEV时CPU占用率高%sys - 没有并行 - 正常吗?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述



运行我的Fortran代码时,我观察到安静的高系统CPU使用率。 用户CPU使用率大约需要一个内核(系统是带有4个内核/ 8个线程的Intel i7,运行Linux),而系统CPU耗用大约2个内核(因此整体CPU使用率大约为75%)。任何人都可以向我解释这是来自哪里,如果这是正常的行为?



我用gfortran编译代码(优化关闭了-O0,虽然那部分似乎并不重要),并链接到BLAS,LAPACK和一些(其他)C函数。我自己的代码没有使用任何并行化,链接的代码也没有(据我所知)。至少我没有使用任何并行化的库版本。



代码本身是关于装配和求解有限元系统的,并且使用了很多(?)分配函数和内部函数调用(matmul,dot_product),尽管整体内存使用率很低(〜200MB)。我不知道这些信息是否足够/有用,但我希望有人知道那里发生了什么。



最好的问候,
Ben



更新
我认为我确实追踪了(从LAPACK调用DSYEV的问题的一部分)(计算真实symm的特征值。

 程序测试

隐式无

integer,parameter :: ndim = 3
real(8):: tens(ndim,ndim)

integer :: mm,nn
real(8) ,dimension(ndim,ndim):: eigvec
real(8),dimension(ndim):: eigval

character,parameter :: jobz ='v'!特征向量的标志计算
character,parameter :: uplo ='u'!标志上三角
整数,参数:: lwork = 102!工作阵列的长度
real(8),dimension(lwork):: work!工作数组
integer :: info

十(1,:) = [1.d0,2.d0,3.d0]
十(2,:) = [ 2.d0,5.d0,1.d0]
十(3,:) = [3.d0,1.d0,1.d0]

do mm = 1,5000000
eigvec =几十美元b美元b!打电话给DSYEV
打电话给dsyev(jobz,uplo,ndim,eigvec,ndim,eigval,work,lwork,info)
enddo

写(*,*)eigvec
写入(*,*)int(work(1))

endprogram test



编译和链接使用

  gfortran test.f90 -o test -llapack 

这个程序给我非常高的%sys CPU使用率。任何人都可以验证这一点(显然LAPACK是必要的代码)?这是正常的行为还是我的代码/系统/图书馆出了问题...?
$ b $ p $ UPDATE 2
鼓励通过@ roygvib的评论我在另一个系统上运行代码。在第二个系统上,不能再现高CPU系统的使用率。比较这两个系统,我似乎无法找到它来自哪里。两者都运行相同的操作系统版本(Linux Ubuntu),相同的gfortran版本(4.8),内核版本,LAPACK和BLAS。 主要区别:处理器是在越野车系统上的i7-4770,而在另一侧是i7-870。在越野车上运行测试代码给我介绍%16s %sys 28s 。在i7-870上,它是%16s %sys 0s 。运行代码四次(并行)给了我一个关于其他系统和 44s 的每个过程的整体时间安排。
任何想法我还能找到什么?



更新3
我认为我们正在接近:
在静态链接到LAPACK和BLAS库的其他系统上构建测试程序,

  gfortran test.f90  - O0 /usr/lib/liblapack.a /usr/lib/libblas.a -Wl, -  allow-multiple-definition 

并且在bug系统中运行该代码给我一个约为0的%sys时间(根据需要)。另一方面,在有问题的系统上通过静态链接建立LAPACK和BLAS的测试程序,并在另一个系统上运行代码,也会返回较高的%sys CPU使用率!显然,图书馆似乎有所不同,对吧?
在buggy系统上构建静态版本会导致文件大小约为18MB(!),而在另一个系统上则为100KB。另外,我必须包含

$ $ $ $ $ c $ -Wl,允许多重定义
code (另外抱怨xerbla的多个定义),而在越野车系统中,我必须(明确地)链接到libpthread

/ p>

  gfortran test.f90 -O0 /usr/lib/liblapack.a /usr/lib/libblas.a -lpthread -o test 

有趣的是,

  apt-cache策略liblapack * 

返回相同版本和回购目标对于两个系统(同样适用于libblas *)。还有什么想法?也许还有其他一些命令来检查我不知道的库版本?

解决方案

我对减速的解释是:



使用了LAPACK和BLAS的线程(可能是OpenMP)版本。这些尝试启动几个线程来并行解决线性代数问题。这通常会加速计算。



然而在这种情况下,

  do mm = 1,5000000 
eigvec =数十美元b美元b美元b美元b美元b美元b美元b美元b美元b美元b美元b美元b美元b美元b美元b美元p $ p>

对于一个非常小的问题(一个3x3矩阵),这是无数次的调用这个库。这并不能有效地解决,矩阵太小。与线程同步相关的开销主导了解决方案的时间。同步(如果不是线程创建) 5000000次



补救措施:

如果使用OpenMP set <$ c完成并行化,则使用非线程化的BLAS和LAPACK


  • $ c> OMP_NUM_THREADS = 1 这意味着仅使用一个线程


  • 完全不使用LAPACK因为对于特殊情况3x3,可以使用专用算法 https://en.wikipedia。 org / wiki / Eigenvalue_algorithm#3.C3.973_matrices



  • See further update below

    I am observing a quiet high system CPU usage when running my Fortran code. The "user CPU usage" is taking about one core (system is an Intel i7 with 4 cores/ 8 threads, running Linux) whilst system CPU is eating up about 2 cores (hence overall CPU usage about 75%). Can anyone explain to me where this is coming from and if this is "normal" behaviour?

    I compile the code with gfortran (optimization turned off -O0, though that part doesn't seem to matter) and link against BLAS, LAPACK and some (other) C-functions. My own code is not using any parallelization and neither does the linked code (as far as I can tell). At least I am not using any parallelized library versions.

    The code itself is about assembling and solving finite element systems and uses a lot (?) of allocating and intrinsic function calls (matmul, dot_product), though the overall RAM usage is pretty low (~200MB). I don't know if this information is sufficient/ useful, but I hope someone knows what is going on there.

    Best regards, Ben

    UPDATE I think I did track down (part of) the problem to a call to DSYEV from LAPACK (computes eigenvalues of a real symm. matrix A, in my case 3x3).

    program test
    
    implicit none
    
    integer,parameter :: ndim=3
    real(8) :: tens(ndim,ndim)
    
    integer :: mm,nn
    real(8), dimension(ndim,ndim):: eigvec
    real(8), dimension(ndim)   :: eigval
    
    character, parameter    :: jobz='v'  ! Flags calculation of eigenvectors
    character, parameter    :: uplo='u'  ! Flags upper triangular 
    integer, parameter      :: lwork=102   ! Length of work array
    real(8), dimension(lwork)  :: work      ! Work array
    integer :: info   
    
    tens(1,:) = [1.d0, 2.d0, 3.d0]
    tens(2,:) = [2.d0, 5.d0, 1.d0]
    tens(3,:) = [3.d0, 1.d0, 1.d0]   
    
    do mm=1,5000000    
        eigvec=tens
       ! Call DSYEV
       call dsyev(jobz,uplo,ndim,eigvec,ndim,eigval,work,lwork,info)
    enddo
    
    write(*,*) eigvec
    write(*,*) int(work(1))
    
    endprogram test
    

    The compiling and linking is done with

    gfortran test.f90 -o test -llapack
    

    This program is giving me very high %sys CPU usage. Can anyone verify this (obviously LAPACK is necessary to un the code)? Is this "normal" behaviour or is something wrong with my code/system/librariers...?

    UPDATE 2 Encouraged by @roygvib's comment I ran the code on another system. On the second system, the high CPU sys usage could not be reproduced. Comparing the two systems I can't seem to find where this is coming from. Both run the same OS version (Linux Ubuntu), same gfortran version (4.8), Kernel Version, LAPACK and BLAS. "Major" difference: the processor is an i7-4770 on the buggy system and an i7-870 on the other. Running the test code on the buggy one is giving me about %user 16s and %sys 28s. On the i7-870 it is %user 16s %sys 0s. Running the code four times (parallel) gives me an overall timing for each process of about 18s on the other system and 44s on the buggy system. Any ideas what else I could look for?

    UPDATE 3 I think we are getting closer: Building the test program on the other system with a static link to the LAPACK and BLAS library,

    gfortran test.f90 -O0 /usr/lib/liblapack.a /usr/lib/libblas.a -Wl,--allow-multiple-definition
    

    and running that code in the buggy system gives me a %sys time of about 0 (as desired). On the other hand, building the test program with static links to LAPACK and BLAS on the buggy system and running the code on the other system return high %sys CPU usage as well! So obviously, the libraries seem to differ, right? Building the static version on the buggy system results in a file size of about 18MB(!), on the other system 100KB. Additionaley I have to include the

    -Wl,--allow-multiple-definition
    

    command only on the other system (otherwise complains about multiple definitions of xerbla), whilst on the buggy system I have to (explicitly) link against libpthread

    gfortran test.f90 -O0 /usr/lib/liblapack.a /usr/lib/libblas.a -lpthread -o test
    

    The interesting thing is that

    apt-cache policy liblapack*
    

    returns the same versions and repo destinations for both systems (same goes for libblas*). Any further ideas? Maybe there is some other command to check library version that I don't know of?

    解决方案

    My interpretation of the slowdown:

    A threaded (probably OpenMP) version of LAPACK and BLAS wes used. These try to launch several threads to solve the linear algebra problem in parallel. That often speeds-up the computation.

    However in this case

    do mm=1,5000000    
       eigvec=tens
       call dsyev(jobz,uplo,ndim,eigvec,ndim,eigval,work,lwork,info)
    enddo
    

    This is numerous times calling the library for a very small problem (a 3x3 matrix). This cannot be efficiently solved in parallel, the matrix is too small. The overhead connected with the synchronization of the threads dominates the solution time. The synchronization (if not even thread creation) is done 5000000 times!

    Remedies:

    1. use a non-threaded BLAS and LAPACK

    2. if the parallelization is done using OpenMP set OMP_NUM_THREADS=1 which means use only one thread

    3. do not use LAPACK at all because for the special case 3x3 there are specialized algorithms available https://en.wikipedia.org/wiki/Eigenvalue_algorithm#3.C3.973_matrices

    这篇关于Fortran LAPACK:使用DSYEV时CPU占用率高%sys - 没有并行 - 正常吗?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

    查看全文
    登录 关闭
    扫码关注1秒登录
    发送“验证码”获取 | 15天全站免登陆