Python 3:多个矩阵的并行对角化 [英] Python 3: Parallel diagonalization of multiple matrices

查看:54
本文介绍了Python 3:多个矩阵的并行对角化的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试改进我的一些代码的性能,它首先根据两个索引构造一个 4x4 矩阵,对该矩阵进行对角化,然后将每个矩阵的每个对角化的特征向量存储在一个 4 维数组中.目前,我只是依次遍历所有索引,然后将特征向量存储在 4 维数组中的位置.现在,我想知道是否有可能通过使用线程或类似的东西来稍微并行化,这样每个线程都会对角化一个矩阵,然后将它存储在它的位置.我的问题是,我这样做的局限性是什么?当不同的线程想要写入生成的 4-dim 时,我会遇到问题吗?数组,我是否必须使用锁来防止这种情况?如果这个问题是微不足道的,我很抱歉,但通过搜索我找不到任何相关的东西,而且我对线程的了解非常有限.一个最小的例子是

I am trying to improve the performance of some code of mine, that first constructs a 4x4 matrix depending on two indices, diagonalizes this matrix and then stores the eigenvectors of each diagonalization of each matrix in an 4-dimensional array. At the moment I am just going through all the indices serially and then store the eigenvectors in its place in the 4-dimensional array. Now, I am wondering if it is possible to parallelize this a little bit by using threading or something similar such that each thread would diagonalize one matrix and then store it in its place. The problem I have is, what are my limitations in doing this? Would I run into problems when different threads want to write into the resulting 4-dim. array at the same time and do I have to use a lock in order to prevent this? I am sorry if this question is trivial, but by searching I was not able to find anything related and my knowledge about threading is very limited. A minimal example would be

from numpy.linalg import eigh as eigh2
from scipy import *

spectrum = zeros([L//2,L//2,4,4],complex)
for i in range(0,L//2):
    for j in range(0,L//2):
        k = [-(2 * i*2*pi/L),-(2 * j*2*pi/L)]
        H = ones([4,4],complex)
        energies, states = eigh2(H)
        spectrum[i,j,:,:] = states

请注意,为了简洁起见,我已将根据 k 构造矩阵的函数替换为某个常数矩阵.

Note that I have exchanged the function that constructs the matrix in dependence of k for some constant matrix for sake of brevity.

我真的很感激任何关于如何实现一些并行化的资源的帮助或指针.线程是提高性能的一种现实方法吗?

I would really appreciate any help or pointers to resources how I could implement some parallelizations. Is threading a realistic way of improving the performance?

推荐答案

简短的回答是,是的,您可能需要锁——但是如果您可以重新组织您的问题,那可能比锁定要好得多.

The short answer is that yes, you probably need locks—but if you can reorganize your problem, that may be a lot better than locking.

长答案有点牵强,特别是因为我不知道你已经知道多少.

The long answer is a bit involved, especially since I don't know how much you already know.

总的来说,由于 Global解释器锁,如果另一个线程正在这样做,它会阻止任何线程解释 Python 的一行(实际上是字节码).然而,NumPy 有代码在某些地方专门释放 GIL 以允许线程更好地工作,所以如果你在低级 NumPy 算法中受 CPU 限制,线程实际上可以em> 工作.文档并不总是清楚哪些函数执行此操作,哪些函数不执行此操作,因此您可能必须自己测试以了解并行化在这里是否有帮助.(一种快速而肮脏的方法是编写一个只执行计算而不将它们存储在任何地方的代码版本,在 N 个线程上运行它,并在您执行此操作时查看有多少内核处于忙碌状态.)

In general, threading doesn't do much good in CPython for CPU-bound code, because of the Global Interpreter Lock, which prevents any threads from interpreting a line (actually, bytecode) of Python if another thread is in the middle of doing so. However, NumPy has code that specifically releases the GIL in certain places to allow threading to work better, so if you're CPU-bound within low-level NumPy algorithms, threading actually can work. The docs are not always clear about which functions do this and which don't, so you may have to test it yourself just to find out if parallelizing will help here. (A quick&dirty way to do this is to hack up a version of your code that just does the computations without storing them anywhere, run it across N threads, and see how many cores are busy while you do it.)

现在,一般来说,在 CPython 中,围绕某些类型的操作不需要锁定,包括简单类型上的 __setitem__——但那是因为相同的 GIL,所以它不会在这里帮助你.如果您有多个操作都试图写入同一个数组,则它们将需要在该数组周围加锁.

Now, in general, in CPython, locks aren't necessary around certain kinds of operations, including __setitem__ on simple types—but that's because of that same GIL, so it isn't going to help you here. If you have multiple operations all trying to write to the same array, they will need a lock around that array.

但是可能有更好的方法来解决这个问题.如果您能找到一种方法将数组划分为更小的数组,并且在任何给定时间只修改其中一个数组,那么您就不需要任何锁.或者,如果您可以让线程返回较小的数组,这些数组可以由单个主线程组装成最终答案,而不是首先就地工作,那也可以.

But there may be a better way around this. If you can find a way to divide the array into smaller arrays, only one of which is being modified at any given time, you don't need any locks. Or, if you can have the threads return smaller arrays that can be assembled by a single master thread into the final answer, instead of working in-place in the first place, that also works.

但在你开始之前......在某些情况下,NumPy(或者更确切地说,它正在使用的库之一)已经为你自动并行化了东西,或者如果你以不同的方式构建它.或者它可能是 SIMD 向量化的东西,它实际上比线程化提供了更多的加速,你最终可能会破坏它.等等.

But before you go doing that… in some cases, NumPy (or, rather, one of the libraries it's using) is already auto-parallelizing things for you, or could be if you built it differently. Or it could be SIMD-vectorizing things in a way that actually gives more speedup than threading, which you could end up breaking. And so on.

因此,在尝试任何操作之前,请确保您有一个经过适当优化的 NumPy,并安装了所有可选的先决条件.然后确保它只按原样使用一个核心.然后构建一个测试脚手架,以便您可以比较不同的实现.然后你可以尝试每个你能想到的基于锁、非共享和非变异的算法,看看并行性是否比额外的东西带来的伤害更大.

So, make sure you have a properly-optimized NumPy with all the optional prereqs installed before you try anything. Then make sure it's only using one core as-is. Then build a test scaffolding so you can compare different implementations. And then you can try out each lock-based, non-sharing, and non-mutating algorithm you can come up with to see if the parallelism helps more than the extra stuff hurts.

这篇关于Python 3:多个矩阵的并行对角化的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆