LAPACK:打包存储矩阵上的操作速度更快吗? [英] LAPACK: Are operations on packed storage matrices faster?

查看:229
本文介绍了LAPACK:打包存储矩阵上的操作速度更快吗?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想用Fortran和LAPACK对一个真正的对称矩阵进行三角化。 LAPACK基本上提供了两个例程,一个在全矩阵上运行,另一个在包装存储的矩阵中运行。虽然后者确实使用较少的内存,但我想知道是否有什么可以说速度差异?

解决方案

这是一个经验问题,当然:但是一般来说,没有任何东西是免费的,而更少的内存/更多的运行时间是一个非常常见的折衷。



在这种情况下,数据的索引是对于打包案例来说更复杂,所以当你遍历矩阵时,获取数据的成本稍高一些。 (更复杂的是,对于对称矩阵,lapack程序也会采用某种打包方式 - 只有矩阵的上部或下部组件可用)。

我今天早些时候在讨论一个特征问题,所以我会用它作为度量基准;尝试一个简单的对称测试用例(Herdon矩阵,来自 http:/ /people.sc.fsu.edu/~jburkardt/m_src/test_mat/test_mat.html ),并将 ssyevd sspevd

  $ ./eigen2 500 
生成赫尔登矩阵:
解包数组:
特征值L_infty err = 1.7881393E-06
包装数组:
特征值L_infty err = 3.0994415E-06
包装时间:2.800000086426735E-002
未包装时间: 2.500000037252903E-002
$ b $ ./eigen2 1000
生成Herdon矩阵:
解包数组:
特征值L_infty err = 4.5299530E-06
装箱array:
特征值L_infty err = 5.8412552E-06
包装时间:0.193900004029274
解包时间:0.165000006556511
$ b $ ./eigen2 2500
生成一个Herdon矩阵:
解包数组:
特征值L_infty err = 6 .1988831E-06
包装数组:
特征值L_infty err = 8.4638596E-06
包装时间:3.21040010452271
未包装时间:2.70149993896484

大约有18%的差异,我必须承认它大于我的预期(对于包装箱来说也有一个稍大的误差?)。这是英特尔的MKL。性能差异将取决于你的矩阵,当然,正如eriktous指出的那样,以及你正在做的问题;对矩阵的随机访问越多,开销就越大。我使用的代码如下:

 程序特征
隐式无

整数: :nargs,n!问题大小
real,dimension(:, :),allocatable :: A,B,Z
real,dimension(:),allocatable :: PA
real,dimension(:),allocatable :: work
integer,dimension(:),allocatable :: iwork
real,dimension(:),allocatable :: eigenvals,expected
real :: c,p
integer :: worksize,iworksize
character(len = 100):: nstr
integer :: unpackedclock,packedclock
double precision :: unpackedtime,packedtime
integer :: i,j, info

!获得文件名
nargs = command_argument_count()
if(nargs / = 1)then
print *,'用法:eigen2 n'
print *,'其中n =数组大小'
stop
endif
call get_command_argument(1,nstr)
read(nstr,'(I)')n
if(n <4 .or。 n> 25000),则
print *,'Invalid n',nstr
stop
endif


!初始化本地数组

allocate(A(n,n),B(n,n))
allocate(eigenvals(n))

!计算矩阵 - 解包

print *,'生成Herdon矩阵:'

A = 0.
c =(1. * n *(1. * n + 1)*(2 * n-5。))/ 6。
forall(i = 1:n-1,j = 1:n-1)
A(i,j)= -1。* i * j / c
endforall $ b $ (i = 1)
A(i,i)=(c-1. * i * i)/ c
A(i,n)= 1. * i / c
endforall
forall(j = 1:n-1)
A(n,j)= 1. * j / c
endforall
A(n, n)= -1./c
B = A

!预期特征值
分配(期望(n))
p = 3. + sqrt((4. * n - 3)*(n - 1。)* 3./(n+1。))
expected(1)= p /(n *(5.-2。* n))
expected(2)= 6./(p*(n+1。))
预期(3:n)= 1。

print *,'解包数组:'
allocate(work(1),iwork(1))
call ssyevd('N ','U',n,A,n,eigenvals,work,-1,iwork,-1,info)
worksize = int(work(1))
iworksize = int ))
解除分配(工作,Iwork)
分配(工作(工作),iwork(iworksize))

调用tick(解包时钟)
调用ssyevd('N ','U',n,A,n,特征值,工作,工作,iwork,iworksize,info)
unpackedtime = tock(解包时钟)
解除分配(工作,工作)

if(info / = 0)then
print *,'Error - info =',info
endif
print *,'Eigenvalues L_infty err =',maxval(eigenvals-expected )


! (n(n,n))分配数组

print *,'Packed array:'
allocate(PA(n *(n + 1)/ 2))
allocate(Z(n,n))
do i = 1,n
do j = i,n
PA(i +(j-1)* j / 2)= B(i,j)
enddo
enddo

allocate(work(1),iwork(1))
call sspevd('N','U',n,PA,eigenvals,Z,n,work ,-1,iwork,-1,info)
worksize = int(work(1))
iworksize = iwork(1)
deallocate(work,iwork)
allocate( (工作),iwork(iworksize))

call tick(packedclock)
call sspevd('N','U',n,PA,eigenvals,Z,n,work, (工作,iwork)
取消分配(Z,A,B,PA)

取消分配(工作,iwork)
取消分配if(info / = 0)then
print *,'Error - info =',info
endif
print *,'Eigenvalues L_infty err =',&
maxval(eigenvals-expected)

deallocate(eigenvals,expected)


print *,'Packed time:',packedtime
print *,'Unpacked time:',unpackedtime


包含
子程序tick(t)
整数,意图(OUT):: t

调用system_clock(t)
结束子程序tick

!返回从现在到以秒为单位的时间,单位为秒
实数函数tock(t)
整数,意图(in):: t
integer :: now,clock_rate

call system_clock(now,clock_rate)

tock = real(now -t)/ real(clock_rate)
结束函数tock

结束程序特征


I want to tridiagonalize a real symmetric matrix using Fortran and LAPACK. LAPACK basically provides two routines, one operating on the full matrix, the other on the matrix in packed storage. While the latter surely uses less memory, I was wondering if anything can be said about the speed difference?

解决方案

It's an empirical question, of course: but in general, nothing comes for free, and less memory/more runtime is a pretty common tradeoff.

In this case, the indexing for the data is more complex for the packed case, so as you traverse the matrix, the cost of getting your data is a little higher. (Complicating this picture is that for symmetric matrices, the lapack routines also assume a certain kind of packing - that you only have the upper or lower component of the matrix available).

I was messing around with an eigenproblem earlier today, so I'll use that as a measurement benchmark; trying with a simple symmetric test case (The Herdon matrix, from http://people.sc.fsu.edu/~jburkardt/m_src/test_mat/test_mat.html ), and comparing ssyevd with sspevd

$ ./eigen2 500
 Generating a Herdon matrix: 
 Unpacked array:
 Eigenvalues L_infty err =   1.7881393E-06
 Packed array:
 Eigenvalues L_infty err =   3.0994415E-06
 Packed time:   2.800000086426735E-002
 Unpacked time:   2.500000037252903E-002

$ ./eigen2 1000
 Generating a Herdon matrix: 
 Unpacked array:
 Eigenvalues L_infty err =   4.5299530E-06
 Packed array:
 Eigenvalues L_infty err =   5.8412552E-06
 Packed time:   0.193900004029274     
 Unpacked time:   0.165000006556511  

$ ./eigen2 2500
 Generating a Herdon matrix: 
 Unpacked array:
 Eigenvalues L_infty err =   6.1988831E-06
 Packed array:
 Eigenvalues L_infty err =   8.4638596E-06
 Packed time:    3.21040010452271     
 Unpacked time:    2.70149993896484 

There's about an 18% difference, which I must admit is larger than I expected (also with a slightly larger error for the packed case?). This is with intel's MKL. The performance difference will depend on your matrix in general, of course, as eriktous points out, and on the problem you're doing; the more random access to the matrix you have to do, the worse the overhead would be. The code I used is as follows:

program eigens
      implicit none

      integer :: nargs,n  ! problem size 
      real, dimension(:,:), allocatable :: A, B, Z
      real, dimension(:), allocatable :: PA
      real, dimension(:), allocatable :: work
      integer, dimension(:), allocatable :: iwork
      real, dimension(:), allocatable :: eigenvals, expected
      real :: c, p
      integer :: worksize, iworksize
      character(len=100) :: nstr
      integer :: unpackedclock, packedclock 
      double precision :: unpackedtime, packedtime
      integer :: i,j,info

! get filename
      nargs = command_argument_count()
      if (nargs /= 1) then
          print *,'Usage: eigen2 n'
          print *,'       Where n = size of array'
          stop
      endif
      call get_command_argument(1, nstr)
      read(nstr,'(I)') n
      if (n < 4 .or. n > 25000) then
          print *, 'Invalid n ', nstr
          stop
      endif


! Initialize local arrays    

      allocate(A(n,n),B(n,n))
      allocate(eigenvals(n)) 

! calculate the matrix - unpacked

      print *, 'Generating a Herdon matrix: '

      A = 0.
      c = (1.*n * (1.*n + 1.) * (2.*n - 5.))/6.
      forall (i=1:n-1,j=1:n-1)
        A(i,j) = -1.*i*j/c
      endforall
      forall (i=1:n-1)
        A(i,i) = (c - 1.*i*i)/c
        A(i,n) = 1.*i/c
      endforall
      forall (j=1:n-1)
        A(n,j) = 1.*j/c
      endforall
      A(n,n) = -1./c
      B = A

      ! expected eigenvalues
      allocate(expected(n))
      p = 3. + sqrt((4. * n - 3.) * (n - 1.)*3./(n+1.))
      expected(1) = p/(n*(5.-2.*n))
      expected(2) = 6./(p*(n+1.))
      expected(3:n) = 1.

      print *, 'Unpacked array:'
      allocate(work(1),iwork(1))
      call ssyevd('N','U',n,A,n,eigenvals,work,-1,iwork,-1,info)
      worksize = int(work(1))
      iworksize = int(work(1))
      deallocate(work,iwork)
      allocate(work(worksize),iwork(iworksize))

      call tick(unpackedclock)
      call ssyevd('N','U',n,A,n,eigenvals,work,worksize,iwork,iworksize,info)
      unpackedtime = tock(unpackedclock)
      deallocate(work,iwork)

      if (info /= 0) then
           print *, 'Error -- info = ', info
      endif
      print *,'Eigenvalues L_infty err = ', maxval(eigenvals-expected)


      ! pack array

      print *, 'Packed array:'
      allocate(PA(n*(n+1)/2))
      allocate(Z(n,n))
      do i=1,n 
        do j=i,n
           PA(i+(j-1)*j/2) = B(i,j)
        enddo
      enddo

      allocate(work(1),iwork(1))
      call sspevd('N','U',n,PA,eigenvals,Z,n,work,-1,iwork,-1,info)
      worksize = int(work(1))
      iworksize = iwork(1)
      deallocate(work,iwork)
      allocate(work(worksize),iwork(iworksize))

      call tick(packedclock)
      call sspevd('N','U',n,PA,eigenvals,Z,n,work,worksize,iwork,iworksize,info)
      packedtime = tock(packedclock)
      deallocate(work,iwork)
      deallocate(Z,A,B,PA)

      if (info /= 0) then
           print *, 'Error -- info = ', info
      endif
      print *,'Eigenvalues L_infty err = ', &
      maxval(eigenvals-expected)

      deallocate(eigenvals, expected)


      print *,'Packed time: ', packedtime
      print *,'Unpacked time: ', unpackedtime


contains
    subroutine tick(t)
        integer, intent(OUT) :: t

        call system_clock(t)
    end subroutine tick

    ! returns time in seconds from now to time described by t
    real function tock(t)
        integer, intent(in) :: t
        integer :: now, clock_rate

        call system_clock(now,clock_rate)

        tock = real(now - t)/real(clock_rate)
    end function tock

end program eigens

这篇关于LAPACK:打包存储矩阵上的操作速度更快吗?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆