用于阵列结构的MPI-3共享内存 [英] MPI-3 Shared Memory for Array Struct

查看:131
本文介绍了用于阵列结构的MPI-3共享内存的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个简单的C ++结构,该结构基本上包装了一个标准C数组:

I have a simple C++ struct that basically wraps a standard C array:

struct MyArray {
    T* data;
    int length;
    // ...
}

其中,T是数字类型,例如floatdouble. length是数组中元素的数量.通常,我的数组非常大(成千上万到数千万个元素).

where T is a numeric type like float or double. length is the number of elements in the array. Typically my arrays are very large (tens of thousands up to tens of millions of elements).

我有一个MPI程序,我想通过MPI 3共享内存将两个MyArray实例(例如a_olda_new)作为共享内存对象公开.上下文是每个MPI等级都从a_old读取.然后,每个MPI等级都写入a_new的某些索引(每个等级仅写入其自己的索引集-没有重叠).最后,必须在所有等级上设置a_old = a_new. a_olda_new的大小相同.现在,我通过将每个等级的更新值与其他等级同步(Isend/Irecv)使我的代码正常工作.但是,由于数据访问模式的原因,没有理由我需要承担消息传递的开销,而可以拥有一个共享内存对象,而在a_old = a_new之前放置一个障碍.我认为这可以给我带来更好的性能(但是如果我错了,请纠正我).

I have an MPI program where I would like to expose two instances of MyArray, say a_old and a_new, as shared memory objects via MPI 3 shared memory. The context is that each MPI rank reads from a_old. Then, each MPI rank writes to certain indices of a_new (each rank only writes to its own set of indices - no overlap). Finally, a_old = a_new must be set on all ranks. a_old and a_new are the same size. Right now I'm making my code work by syncing (Isend/Irecv) each rank's updated values with other ranks. However, due to the data access pattern, there's no reason I need to incur the overhead of message passing and could instead have one shared memory object and just put a barrier before a_old = a_new. I think this would give me better performance (though please correct me if I'm wrong).

我很难找到使用MPI 3进行共享内存的完整代码示例.大多数站点仅提供参考文档或不完整的摘录.有人可以指导我看一个简单的 complete 代码示例,该示例执行我想实现的事情(通过MPI共享内存更新和同步数字数组)吗?我了解创建共享内存通信器和窗口,设置围栏等的主要概念,但这确实有助于我的理解,看到一个将所有内容组合在一起的示例.

I have had trouble finding complete code examples of doing shared memory with MPI 3. Most sites only provide reference documentation or incomplete snippets. Could someone walk me through a simple and complete code example that does the sort of thing I'm trying to achieve (updating and syncing a numeric array via MPI shared memory)? I understand the main concepts of creating shared memory communicators and windows, setting fences, etc., but it would really help my understanding to see one example that puts it all together.

此外,我应该提到的是,我只会在一个节点上运行代码,因此不必担心跨节点需要共享内存对象的多个副本;对于运行MPI进程的单个节点,我只需要一个数据副本即可.尽管如此,在这种情况下,其他解决方案对我来说还是不可行的,因为我有大量的MPI代码,并且无法为要共享的一两个数组重写所有内容.

Also, I should mention that I'll only be running my code on one node, so I don't need to worry about needing multiple copies of my shared-memory object across nodes; I just need one copy of my data for the single node on which my MPI processes are running. Despite this, other solutions like OpenMP aren't feasible for me in this case, since I have a ton of MPI code and can't rewrite everything for the sake of one or two arrays I'd like to share.

推荐答案

在MPI-3中使用共享内存相对简单.

Using shared memory with MPI-3 is relatively simple.

首先,您使用MPI_Win_allocate_shared分配共享内存窗口:

First, you allocate the shared memory window using MPI_Win_allocate_shared:

MPI_Win win;
MPI_Aint size;
void *baseptr;

if (rank == 0)
{
   size = 2 * ARRAY_LEN * sizeof(T);
   MPI_Win_allocate_shared(size, sizeof(T), MPI_INFO_NULL,
                           MPI_COMM_WORLD, &baseptr, &win);
}
else
{
   int disp_unit;
   MPI_Win_allocate_shared(0, sizeof(T), MPI_INFO_NULL,
                           MPI_COMM_WORLD, &baseptr, &win);
   MPI_Win_shared_query(win, 0, &size, &disp_unit, &baseptr);
}
a_old.data = baseptr;
a_old.length = ARRAY_LEN;
a_new.data = a_old.data + ARRAY_LEN;
a_new.length = ARRAY_LEN;

在这里,只有等级0分配内存.共享时,哪个进程分配它并不重要.甚至有可能让每个进程分配一部分内存,但是由于默认情况下分配是连续的,因此这两种方法都是等效的.然后,所有其他进程都使用MPI_Win_shared_query来查找共享存储块开始处其虚拟地址空间中的位置.该地址可能因行而异,因此不应绕过绝对指针.

Here, only rank 0 allocates memory. It doesn't really matter which process allocates it as it is shared. It is even possible to have each process allocate a portion of the memory, but since by the default the allocation is contiguous, both methods are equivalent. MPI_Win_shared_query is then used by all other processes to find out the location in their virtual address space of the beginning of the shared memory block. That address might vary among the ranks and therefore one should not pass around absolute pointers.

您现在可以简单地分别从a_old.data加载和存储到a_new.data.由于案例中的排名是在不相交的内存位置集上进行的,因此您实际上不需要锁定窗口.使用窗口锁来实现例如a_old的受保护的初始化或其他需要同步的操作.您可能还需要明确告知编译器不要对代码进行重新排序并发出内存隔离,以使所有未完成的加载/存储操作在例如您致电MPI_Barrier().

You can now simply load from and store into a_old.data respectively a_new.data. As the ranks in your case work on disjoint sets of memory locations, you don't really need to lock the window. Use window locks to implement e.g. protected initialisation of a_old or other operations that require synchronisation. You might also need to explicitly tell the compiler not to reorder the code and to emit a memory fence in order to have all outstanding load/store operations finished before e.g. you call MPI_Barrier().

a_old = a_new代码建议将一个数组复制到另一个数组上.相反,您可以简单地交换数据指针并最终交换大小字段.由于仅阵列的数据在共享存储器块中,因此交换指针是本地操作,即不需要同步.假设两个数组的长度相等:

The a_old = a_new code suggests copying one array onto the other. Instead, you could simply swap the data pointers and eventually the size fields. Since only the data of the array is in the shared memory block, swapping the pointers is a local operation, i.e. no synchronisation needed. Assuming that both arrays are of equal length:

T *temp;
temp = a_old.data;
a_old.data = a_new.data;
a_new.data = temp;

在继续进行操作之前,您仍然需要确保所有其他进程都已完成处理.

You still need a barrier to make sure that all other processes have finished processing before continuing further.

最后,只需释放窗口即可:

At the very end, simply free the window:

MPI_Win_free(&win);

一个完整的示例(用C语言编写)如下:

A complete example (in C) follows:

#include <stdio.h>
#include <mpi.h>

#define ARRAY_LEN 1000

int main (void)
{
   MPI_Init(NULL, NULL);

   int rank, nproc;
   MPI_Comm_rank(MPI_COMM_WORLD, &rank);
   MPI_Comm_size(MPI_COMM_WORLD, &nproc);

   MPI_Win win;
   MPI_Aint size;
   void *baseptr;

   if (rank == 0)
   {
      size = ARRAY_LEN * sizeof(float);
      MPI_Win_allocate_shared(size, sizeof(int), MPI_INFO_NULL,
                              MPI_COMM_WORLD, &baseptr, &win);
   }
   else
   {
      int disp_unit;
      MPI_Win_allocate_shared(0, sizeof(int), MPI_INFO_NULL,
                              MPI_COMM_WORLD, &baseptr, &win);
      MPI_Win_shared_query(win, 0, &size, &disp_unit, &baseptr);
   }

   printf("Rank %d, baseptr = %p\n", rank, baseptr);

   int *arr = baseptr;
   for (int i = rank; i < ARRAY_LEN; i += nproc)
     arr[i] = rank;

   MPI_Barrier(MPI_COMM_WORLD);

   if (rank == 0)
   {
      for (int i = 0; i < 10; i++)
         printf("%4d", arr[i]);
      printf("\n");
   }

   MPI_Win_free(&win);

   MPI_Finalize();
   return 0;
}

免责声明:将它与一粒盐一起服用.我对MPI的RMA的了解仍然很薄弱.

Disclaimer: Take this with a grain of salt. My understanding of MPI's RMA is still quite weak.

这篇关于用于阵列结构的MPI-3共享内存的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆