CyclicDist在多个语言环境中变慢 [英] CyclicDist goes slower on multiple locales

查看:115
本文介绍了CyclicDist在多个语言环境中变慢的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我尝试使用 CyclicDist 模块实现矩阵乘法.

I tried doing an implementation of Matrix multiplication using CyclicDist module.

当我使用一种语言环境与两种语言环境进行测试时,一种语言环境的速度要快得多.是因为两个Jetson纳米板之间的通信时间真的很大还是我的实现没有利用CyclicDist的工作方式吗?

When I test with one locale vs two locales, the one locale is much faster. Is it because the time to communicate between the two Jetson nano boards is really big or is my implementation not taking advantage of the way CyclicDist works?

这是我的代码:

 use Random, Time, CyclicDist;
var t : Timer;
t.start();

config const size = 10;
const Space = {1..size, 1..size};

const gridSpace = Space dmapped Cyclic(startIdx=Space.low);
var grid: [gridSpace] real;
fillRandom(grid);
const gridSpace2 = Space dmapped Cyclic(startIdx=Space.low);
var grid2: [gridSpace2] real;
fillRandom(grid2);
const gridSpace3 = Space dmapped Cyclic(startIdx=Space.low);
var grid3: [gridSpace] real;
forall i in 1..size do {
    forall j in 1..size do {
        forall k in 1..size do {
            grid3[i,j] += grid[i,k] * grid2[k,j];
        }
    }
}
t.stop();
writeln("Done!:");
writeln(t.elapsed(),"seconds");
writeln("Size of matrix was:", size);
t.clear()

我知道我的实现对于分布式存储系统不是最佳的.

I know my implementation is not optimal for distributed memory systems.

推荐答案

该程序未进行扩展的主要原因可能是该计算从未使用除初始语言环境之外的任何语言环境.具体来说,forall会在范围内循环,例如代码中的循环:

Probably the main reason that this program is not scaling is that the computation never uses any locales other than the initial one. Specifically, forall loops over ranges, like the ones in your code:

forall i in 1..size do

始终使用在当前语言环境中执行的任务来运行其所有迭代.这是因为范围不是在Chapel中分配的值,因此,它们的并行迭代器不会在区域设置之间分配工作.结果,循环体的所有大小** 3次执行:

always run all of their iterations using tasks executing on the current locale. This is because ranges are not distributed values in Chapel and as a result, their parallel iterators don't distribute work across locales. As a result of this, all size**3 executions of the loop body:

grid3[i,j] += grid[i,k] * grid2[k,j];

将在语言环境0上运行,而没有一个将在语言环境1上运行.您可以通过将以下内容放入最内层循环的主体中来了解这种情况:

will run on locale 0 and none of them will run on locale 1. You can see that this is the case by putting the following into the innermost loop's body:

writeln("locale ", here.id, " running ", (i,j,k));

(其中here.id打印出当前任务正在运行的语言环境的ID).这将显示语言环境0正在运行所有迭代:

(where here.id prints out the ID of the locale where the current task is running). This will show that locale 0 is running all iterations:

0 running (9, 1, 1)
0 running (1, 1, 1)
0 running (1, 1, 2)
0 running (9, 1, 2)
0 running (1, 1, 3)
0 running (9, 1, 3)
0 running (1, 1, 4)
0 running (1, 1, 5)
0 running (1, 1, 6)
0 running (1, 1, 7)
0 running (1, 1, 8)
0 running (1, 1, 9)
0 running (6, 1, 1)
...

与此相反,在gridSpace之类的分布式域上运行永久循环:

Contrast this with running a forall loop over a distributed domain like gridSpace:

forall (i,j) in gridSpace do
  writeln("locale ", here.id, " running ", (i,j));

迭代将在语言环境之间分布的地方:

where the iterations will be distributed between the locales:

locale 0 running (1, 1)
locale 0 running (9, 1)
locale 0 running (1, 2)
locale 0 running (9, 2)
locale 0 running (1, 3)
locale 0 running (9, 3)
locale 0 running (1, 4)
locale 1 running (8, 1)
locale 1 running (10, 1)
locale 1 running (8, 2)
locale 1 running (2, 1)
locale 1 running (8, 3)
locale 1 running (10, 2)
...

由于所有计算都在语言环境0上运行,但是一半的数据位于语言环境1上(由于分布了数组),因此生成了大量通信,以便按顺序从语言环境1的内存中获取语言环境值0进行计算.

Since all of the computation is running on locale 0 but half of the data is located on locale 1 (due to the arrays being distributed), lots of communication is generated to fetch remote values from locale 1's memory to locale 0's in order to compute on it.

这篇关于CyclicDist在多个语言环境中变慢的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆