多核系统的并行线性代数 [英] Parallel linear algebra for multicore system

查看:76
本文介绍了多核系统的并行线性代数的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在开发一个程序,该程序需要进行大量的线性代数计算.

I'm developing a program that needs to do heavy linear algebra calculations.

现在我正在使用 LAPACK/BLAS 例程,但是我需要利用我的机器(24核心Xeon X5690).

Now I'm using LAPACK/BLAS routines, but I need to exploit my machine (24 core Xeon X5690).

我已经找到了像 pblas scalapack 这样的项目,但它们似乎都专注于分布式计算和使用MPI.

I've found projects like pblas and scalapack, but they all seem to focus on distributed computing and on using MPI.

我没有可用的群集,所有计算都将在单个服务器上完成,使用MPI似乎有点过头了.

I have no cluster available, all computations will be done on a single server and using MPI looks like an overkill.

有人对此有任何建议吗?

Does anyone have any suggestion on this?

推荐答案

如@larsmans(例如,带有MKL)所述,您仍然使用LAPACK + BLAS接口,但只是为您的平台找到了经过调整的多线程版本. MKL很棒,但价格昂贵.其他开源选项包括:

As mentioned by @larsmans (with, say, MKL), you still use LAPACK + BLAS interfaces, but you just find a tuned, multithreaded version for your platform. MKL is great, but expensive. Other, open-source, options include:

  • OpenBLAS /地图集:在安装时会自动调整到您的体系结构.对于典型"矩阵(例如,平方SGEMM)可能会更慢,但对于奇数情况可能会更快,并且对于Westmere甚至可能胜过OpenBLAS/GotoBLAS,我自己尚未对此进行测试.多数情况下针对串行情况进行了优化,但确实包含并行多线程例程.
  • Plasma -专为多核设计的LAPACK实现.
  • OpenBLAS / GotoBLAS, the Nehalem support should work ok but no tuned support yet for westmere. Does multithreading very well.
  • Atlas : automatically tunes to your architecture at installation time. probably slower for "typical" matricies (eg, square SGEMM) but can be faster for odd cases, and for westmere may even beat out OpenBLAS/GotoBLAS, haven't tested this myself. Mostly optimized for serial case, but does include parallel multithreading routines.
  • Plasma - LAPACK implementation designed specificially for multicore.

我也同意马克的评论;根据您使用的LAPACK例程,带有MPI的分布式内存实际上可能比多线程要快. BLAS例程不太可能是这种情况,但是对于更复杂的事情(例如LAPACK中的特征值/矢量例程),值得进行测试.虽然MPI函数调用确实是开销,但在分布式内存模式下执行操作意味着您不必担心错误共享,同步对共享变量的访问等.

I'd also agree with Mark's comment; depending on what LAPACK routines you're using, the distributed memory stuff with MPI might actually be faster than the multithreaded. That's unlikely to be the case with BLAS routines, but for something more complicated (say the eigenvalue/vector routines in LAPACK) it's worth testing. While it's true that MPI function calls are an overhead, doing things in a distributed-memory mode means you don't have to worry so much about false sharing, synchronizing access to shared variables, etc.

这篇关于多核系统的并行线性代数的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆