的&QUOT最佳途径;遍历一个二维数组&QUOT ;,使用Repa [英] Best way of "looping over a 2-D array", using Repa

查看:308
本文介绍了的&QUOT最佳途径;遍历一个二维数组&QUOT ;,使用Repa的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我找到阵列库Repa Haskell的很有意思,想使一个简单的程序,试图了解如何使用它。我也使用列表,这被证明是更快做了一个简单的实现。我的主要问题是,我怎么能改善以下Repa code,使其成为最有效的(希望也非常可读的)。我很新的使用哈斯克尔,我无法找到Repa [修改有一个在的哈斯克尔维基,我不知怎么忘了,当我写这篇],所以不要以为我知道任何事情。 :)比如,我不知道什么时候使用武力或d​​eepSeqArray。

I find the array library Repa for Haskell very interesting, and wanted to make a simple program, to try to understand how to use it. I also made a simple implementation using lists, which proved to be much faster. My main question is how I could improve the Repa code below to make it the most efficient (and hopefully also very readable). I am quite new using Haskell, and I couldn't find any easily understandable tutorial on Repa [edit there is one at the Haskell Wiki, that I somehow forgot when I wrote this], so don't assume I know anything. :) For example, I'm not sure when to use force or deepSeqArray.

的程序是用来大致计算球体的以下列方式的体积

The program is used to approximately calculate the volume of a sphere in the following way:


  1. 的中心点和球体的半径被指定,以及一个长方体内等距隔开的坐标,这是众所周知的涵盖范围。

  2. 该方案需要每一个坐标,计算与球体的中心的距离,并且如果它是大于球体的半径较小,它是用来加起来球体的总(近似)体积。

两个版本如下所示,一个使用列表和一个使用repa。我知道code是低效的,尤其是对于这种使用情况,但这个想法是让它变得更复杂以后。

Two versions are shown below, one using lists and one using repa. I know the code is inefficient, especially for this use case, but the idea is to make it more complicated later on.

对于下面的值,并编制GHC -Odph -fllvm -fforce-recomp -rtsopts -threaded列表中版本带有1.4 S,而repa版本需要12 s的RTS +和-N1 10秒以+ RTS -N2,虽然没有火花转换(我有一个双核的Intel计算机运行的是Windows 7 64,GHC 7.0.2和2.8 LLVM(酷睿2 E7400 @ 2.8千兆赫))。 (注释掉以下主要的正确路线只运行一个版本。)

For the values below, and compiling with "ghc -Odph -fllvm -fforce-recomp -rtsopts -threaded", the list version takes 1.4 s, while the repa version takes 12 s with +RTS -N1 and 10 s with +RTS -N2, though no sparks are converted (I have a dual-core Intel machine (Core 2 Duo E7400 @ 2.8 GHz) running Windows 7 64, GHC 7.0.2 and llvm 2.8). (Comment out the correct line in main below to just run one of the versions.)

感谢您的帮助!

import Data.Array.Repa as R
import qualified Data.Vector.Unboxed as V
import Prelude as P

-- Calculate the volume of a sphere by putting it in a bath of coordinates. Generate coordinates (x,y,z) in a cuboid. Then, for each coordinate, check if it is inside the sphere. Sum those coordinates and multiply by the coordinate grid step size to find an approximate volume.


particles = [(0,0,0)] -- used for the list alternative --[(0,0,0),(0,2,0)]
particles_repa = [0,0,0::Double] -- used for the repa alternative, can currently just be one coordinate

-- Radius of the particle
a = 4

-- Generate the coordinates. Could this be done more efficiently, and at the same time simple? In Matlab I would use ndgrid.
step = 0.1 --0.05
xrange = [-10,-10+step..10] :: [Double]
yrange = [-10,-10+step..10]
zrange = [-10,-10+step..10]

-- All coordinates as triples. These are used directly in the list version below.
coords = [(x,y,z)  | x <- xrange, y <- yrange, z <- zrange]

---- List code ----

volumeIndividuals = fromIntegral (length particles) * 4*pi*a**3/3

volumeInside = step**3 * fromIntegral (numberInsideParticles particles coords)

numberInsideParticles particles coords = length $ filter (==True) $ P.map (insideParticles particles) coords

insideParticles particles coord =  any (==True) $ P.map (insideParticle coord) particles

insideParticle (xc,yc,zc) (xp,yp,zp) = ((xc-xp)^2+(yc-yp)^2+(zc-zp)^2) < a**2
---- End list code ----

---- Repa code ----

-- Put the coordinates in a Nx3 array.
xcoords = P.map (\(x,_,_) -> x) coords
ycoords = P.map (\(_,y,_) -> y) coords
zcoords = P.map (\(_,_,z) -> z) coords

-- Total number of coordinates
num_coords = (length xcoords) ::Int

xcoords_r = fromList (Z :. num_coords :. (1::Int)) xcoords
ycoords_r = fromList (Z :. num_coords :. (1::Int)) ycoords
zcoords_r = fromList (Z :. num_coords :. (1::Int)) zcoords

rcoords = xcoords_r R.++ ycoords_r R.++ zcoords_r

-- Put the particle coordinates in an array, then extend (replicate) this array so that its size becomes the same as that of rcoords
particle = fromList (Z :. (1::Int) :. (3::Int)) particles_repa
particle_slice = slice particle (Z :. (0::Int) :. All)
particle_extended = extend (Z :. num_coords :. All) particle_slice

-- Calculate the squared difference between the (x,y,z) coordinates of the particle and the coordinates of the cuboid.
squared_diff = deepSeqArrays [rcoords,particle_extended] ((force2 rcoords) -^ (force2 particle_extended)) **^ 2
(**^) arr pow = R.map (**pow) arr

xslice = slice squared_diff (Z :. All :. (0::Int))
yslice = slice squared_diff (Z :. All :. (1::Int))
zslice = slice squared_diff (Z :. All :. (2::Int))

-- Calculate the distance between each coordinate and the particle center
sum_squared_diff = [xslice,yslice,zslice] `deepSeqArrays` xslice +^ yslice +^ zslice

-- Do the rest using vector, since I didn't get the repa variant working.
ssd_vec = toVector sum_squared_diff

-- Determine the number of the coordinates that are within the particle (instead of taking the square root to get the distances above, I compare to the square of the radius here, to improve performance)
total_within = fromIntegral (V.length $ V.filter (<a**2) ssd_vec)
--total_within = foldAll (\x acc -> if x < a**2 then acc+1 else acc) 0 sum_squared_diff

-- Finally, calculate an approximation of the volume of the sphere by taking the volume of the cubes with side step, multiplied with the number of coordinates within the sphere.
volumeInside_repa = step**3 * total_within 

-- Helper function that shows the size of a 2-D array.
rsize = reverse . listOfShape . (extent :: Array DIM2 Double -> DIM2)

---- End repa code ----

-- Comment out the list or the repa version if you want to time the calculations separately.
main = do
    putStrLn $ "Step = " P.++ show step
    putStrLn $ "Volume of individual particles = " P.++ show volumeIndividuals
    putStrLn $ "Volume of cubes inside particles (list) = " P.++ show volumeInside
    putStrLn $ "Volume of cubes inside particles (repa) = " P.++ show volumeInside_repa

修改:一些背景解释了为什么我写了code,因为它是上面:

Edit: Some background that explains why I have written the code as it is above:

我大多写在Matlab code和我的的性能提升体验从该地区来居多。在Matlab中,通常希望使用矩阵上直接操作功能,提高性能,使您的计算。我的上述问题的实施方式中,在Matlab R2010b中,需要使用如下所示的矩阵版本使用嵌套循环0.9秒,和15秒。虽然我知道Haskell是从MATLAB很大的不同,我的希望是,使用列表使用Repa阵列哈斯克尔将提高code的性能去。从lists-> Repa arrays->载体的转换是有,因为我没有足够的技术更好的东西来替代它们。这就是为什么我要求输入。 :)上面的时序数因此主观的,因为它可以测量的我的的性能比语言能力的多了,但对我来说是一个有效的指标,现在,因为什么什么决定我会使用取决于如果的 I 的可使其工作或没有。

I mostly write code in Matlab, and my experience of performance improvement comes mostly from that area. In Matlab, you usually want to make your calculations using functions operating on matrices directly, to improve performance. My implementation of the problem above, in Matlab R2010b, takes 0.9 seconds using the matrix version shown below, and 15 seconds using nested loops. Although I know Haskell is very different from Matlab, my hope was that going from using lists to using Repa arrays in Haskell would improve the performance of the code. The conversions from lists->Repa arrays->vectors are there because I'm not skilled enough to replace them with something better. This is why I ask for input. :) The timing numbers above is thus subjective, since it may measure my performance more than that of the abilities of the languages, but it is a valid metric for me right now, since what decides what I will use depends on if I can make it work or not.

TL;博士:我明白,我Repa code以上可能是愚蠢的或病理性的,但它是我现在所能做的最好。我很想能写出更好的哈斯克尔code,我希望你能帮助我在这个方向(东斯已经做了)。 :)

tl;dr: I understand that my Repa code above may be stupid or pathological, but it's the best I can do right now. I would love to be able to write better Haskell code, and I hope that you can help me in that direction (dons already did). :)

function archimedes_simple()

particles = [0 0 0]';
a = 4;

step = 0.1;

xrange = [-10:step:10];
yrange = [-10:step:10];
zrange = [-10:step:10];

[X,Y,Z] = ndgrid(xrange,yrange,zrange);
dists2 = bsxfun(@minus,X,particles(1)).^2+ ...
    bsxfun(@minus,Y,particles(2)).^2+ ...
    bsxfun(@minus,Z,particles(3)).^2;
inside = dists2 < a^2;
num_inside = sum(inside(:));

disp('');
disp(['Step = ' num2str(step)]);
disp(['Volume of individual particles = ' num2str(size(particles,2)*4*pi*a^3/3)]);
disp(['Volume of cubes inside particles = ' num2str(step^3*num_inside)]);

end

编辑2 :新的,更快和Repa code的简化版本。

Edit 2: New, faster and simpler version of the Repa code

我现在已经读了多一点的Repa,以为了一下。下面是一个新Repa版本。在这种情况下,创建在x,y和z坐标作为三维阵列,使用Repa延伸功能,从值的列表(类似于如何ndgrid作品在Matlab)。我然后映射在这些阵列以计算到球形颗粒的距离。最后,我折叠所得的3-D距离阵列,计数多少的坐标的范围内,然后由一个常数因子乘以它得到的近似体积。我的算法的执行现在更加类似于上面的Matlab的版本,并且不再有任何转换到矢量

I have now read up a bit more on Repa, and thought a bit. Below is a new Repa version. In this case, I create the x, y, and z coordinates as 3-D arrays, using the Repa extend function, from a list of values (similar to how ndgrid works in Matlab). I then map over these arrays to calculate the distance to the spherical particle. Finally, I fold over the resulting 3-D distance array, count how many coordinates are within the sphere, and then multiply it by a constant factor to get the approximate volume. My implementation of the algorithm is now much more similar to the Matlab version above, and there are no longer any conversion to vector.

新版本在大约5秒钟运行我的电脑,从上面有了很大的改进上。的时间是相同的,如果我使用线程在编译时,用+ RTS -N2组合与否,但线程化的版本做最大程度的发挥我的电脑的双核。我没有,但是,看到-N2跑3.1秒几滴,但后来无法重现他们。也许这是在同一时间运行的其他进程很敏感?基准时,我已经关闭我的电脑上大多数程序,但仍有一些程序运行,比如后台进程。

The new version runs in about 5 seconds on my computer, a considerable improvement from above. The timing is the same if I use "threaded" while compiling, combined with "+RTS -N2" or not, but the threaded version does max out both cores of my computer. I did, however, see a few drops of the "-N2" run to 3.1 seconds, but couldn't reproduce them later. Maybe it is very sensitive to other processes running at the same time? I have shut most programs on my computer when benchmarking, but there are still some programs running, such as background processes.

如果我们用-N2,并添加运行时开关关闭并行GC(-qg),时间持续下降到〜4.1秒,并使用-qa为使用OS设置线程的亲和力(实验)时,时间被调低至〜3.5级秒。纵观输出带+ RTS -s运行程序,使用-qg进行更GC。

If we use "-N2" and add the runtime switch to turn off parallel GC (-qg), the time consistently goes down to ~4.1 seconds, and using -qa to "use the OS to set thread affinity (experimental)", the time was shaved down to ~3.5 seconds. Looking at the output from running the program with "+RTS -s", much less GC is performed using -qg.

今天下午,我将看看我是否能有8核电脑上运行code,只是为了好玩。 :)

This afternoon I will see if I can run the code on an 8-core computer, just for fun. :)

import Data.Array.Repa as R
import Prelude as P
import qualified Data.List as L

-- Calculate the volume of a spherical particle by putting it in a bath of coordinates.     Generate coordinates (x,y,z) in a cuboid. Then, for each coordinate, check if it is     inside the sphere. Sum those coordinates and multiply by the coordinate grid step size to     find an approximate volume.

particles :: [(Double,Double,Double)]
particles = [(0,0,0)]

-- Radius of the spherical particle
a = 4

volume_individuals = fromIntegral (length particles) * 4*pi*a^3/3

-- Generate the coordinates. 
step = 0.1
coords_list = [-10,-10+step..10] :: [Double]
num_coords = (length coords_list) :: Int

coords :: Array DIM1 Double
coords = fromList (Z :. (num_coords ::Int)) coords_list

coords_slice :: Array DIM1 Double
coords_slice = slice coords (Z :. All)

-- x, y and z are 3-D arrays, where the same index into each array can be used to find a     single coordinate, e.g. (x(i,j,k),y(i,j,k),z(i,j,k)).
x,y,z :: Array DIM3 Double
x = extend (Z :. All :. num_coords :. num_coords) coords_slice
y = extend (Z :. num_coords :. All :. num_coords) coords_slice
z = extend (Z :. num_coords :. num_coords :. All) coords_slice

-- Calculate the squared distance from each coordinate to the center of the spherical     particle.
dist2 :: (Double, Double, Double) -> Array DIM3 Double
dist2 particle = ((R.map (squared_diff xp) x) + (R.map (squared_diff yp) y) + (R.map (    squared_diff zp) z)) 
    where
        (xp,yp,zp) = particle
        squared_diff xi xa = (xa-xi)^2

-- Count how many of the coordinates are within the spherical particle.
num_inside_particle :: (Double,Double,Double) -> Double
num_inside_particle particle = foldAll (\acc x -> if x<a^2 then acc+1 else acc) 0 (force     $ dist2 particle)

-- Calculate the approximate volume covered by the spherical particle.
volume_inside :: [Double]
volume_inside = P.map ((*step^3) . num_inside_particle) particles

main = do
    putStrLn $ "Step = " P.++ show step
    putStrLn $ "Volume of individual particles = " P.++ show volume_individuals
    putStrLn $ "Volume of cubes inside each particle (repa) = " P.++ (P.concat . (    L.intersperse ", ") . P.map show) volume_inside

-- As an alternative, y and z could be generated from x, but this was slightly slower in     my tests (~0.4 s).
--y = permute_dims_3D x
--z = permute_dims_3D y

-- Permute the dimensions in a 3-D array, (forward, cyclically)
permute_dims_3D a = backpermute (swap e) swap a
    where
        e = extent a
        swap (Z :. i:. j :. k) = Z :. k :. i :. j

空间分析新code

同类型的配置文件作为唐·斯图尔特的下面进行,但对于新Repa code。

The same types of profiles as Don Stewart made below, but for the new Repa code.



推荐答案

code审批说明


    你的时间
  • 47.8%都花在GC。

  • 1.5G被分配在堆上(!)

  • 的repa code看起来的很多的比列表code更复杂。

  • 并行GC的很多是发生

  • 我可以得到高达300效率%A -N4计算机

  • 在更多类型签名把将使它更容易分析...

  • RSIZE 未使用(看起来贵!)

  • 您转换repa阵列载体,为什么呢?

  • 所有的用途(**)可以通过更便宜的(^)在<$ C取代$ C>内部。

  • 有大,中,不断列出可疑号码。这些都被转换为阵列 - 这似乎昂贵

  • 任何(==真)相同

  • 47.8% of your time is spent in GC.
  • 1.5G is allocated on the heap (!)
  • The repa code looks a lot more complicated than the list code.
  • Lots of parallel GC is occuring
  • I can get up to 300% efficiency on a -N4 machine
  • Putting in more type signatures will make it easier to analyze...
  • rsize isn't used (looks expensive!)
  • You convert repa arrays to vectors, why?
  • All your uses of (**) could be replaced by the cheaper (^) on Int.
  • There's a suspicious number of large, constant lists. Those all have to be converted to arrays -- that seems expensive.
  • any (==True) is the same as or

时间分析

COST CENTRE                    MODULE               %time %alloc

squared_diff                   Main                  25.0   27.3
insideParticle                 Main                  13.8   15.3
sum_squared_diff               Main                   9.8    5.6
rcoords                        Main                   7.4    5.6
particle_extended              Main                   6.8    9.0
particle_slice                 Main                   5.0    7.6
insideParticles                Main                   5.0    4.4
yslice                         Main                   3.6    3.0
xslice                         Main                   3.0    3.0
ssd_vec                        Main                   2.8    2.1
**^                            Main                   2.6    1.4

显示,你的函数 squared_diff 是有点可疑:

squared_diff :: Array DIM2 Double
squared_diff = deepSeqArrays [rcoords,particle_extended]
                    ((force2 rcoords) -^ (force2 particle_extended)) **^ 2    

虽然我没有看到任何明显的修复。

though I don't see any obvious fix.

空格分析

没有在空间配置文件太不可思议了:你清楚地看到列表阶段,那么矢量相位。该列表阶段分配了很多,这被回收。

Nothing too amazing in the space profile: you clearly see the list phase, then the vector phase. The list phase allocates a lot, which gets reclaimed.

按类型打破堆,我们看到最初很多列表和元组被分配(需求),那么阵列的一大块被分配及召开方式:

Breaking down the heap by type, we see initially a lot of lists and tuples being allocated (on demand), then a big chunk of arrays are allocated and held:

再次还挺我们期望看到...数组的东西是不分配尤其是比列表code以上(其实有点不太顺​​位),但它仅仅是采取了很多时间来运行

Again, kinda what we expected to see... the array stuff isn't allocating especially more than the list code (in fact, a bit less overall), but it is just taking a lot longer to run.

检查内存泄露与固定分析的:

还有一些有趣的事情,但没有什么惊人的。 zcoords 被保留列表程序执行的长度,那么有些阵列(SYSTEM)被分配给repa运行。

There's a few interesting things there, but nothing startling. zcoords gets retained for the length of the list program execution, then some arrays (SYSTEM) are being allocated for the repa run.

检查在核心

所以在这一点上,我先假设你真的实现列表和数组相同的算法(即没有额外的工作是在阵列的情况下正在做的),而且没有明显的空间泄漏。所以我怀疑是严重优化repa code。让我们来看看核心(与 GHC核心

So at this point I'm firstly assuming that you really did implement the same algorithms in lists and arrays (i.e. no extra work is being done in the array case), and there's no obvious space leak. So my suspicion is badly-optimized repa code. Let's look at the core (with ghc-core.


  • 基于列表,code看起来不错。

  • 数组code看起来是合理的(即拆箱元出现),但非常复杂的,而且它的很多。

内联所有的CAF

我添加了内联编译指示所有顶级数组的定义,在希望删除一些CAFS,并得到GHC优化阵列code有点困难。这着实让GHC斗争编译模块(分配到4.3G,10分钟就可以了工作时)。这是一个线索,我认为GHC无法很好之前优化这个节目,因为有新的东西为它时,我增加了门槛做的。

I added inline pragmas to all the top level array definitions, in a hope to remove some of the CAfs, and get GHC to optimize the array code a bit harder. This really made GHC struggle to compile the module (allocating up to 4.3G and 10 minutes while working on it). This is a clue to me that GHC wasn't able to optimize this program well before, since there's new stuff for it to do when I increase the thresholds.

操作


  • 使用-H可以降低GC所花费的时间。

  • 尝试以消除列出了转换到REPAS为载体。

  • 所有这些的CAF(顶级恒定的数据结构)是有点奇怪 - 一个真正的程序不会是顶级的常量列表 - 事实上,这个模块是病理那么,造成大量的价值被保留在很长时间,而不是,被优化掉。浮法局部定义向内。

  • 要求从本Lippmeier ,Repa的作者,帮助特别是因为有一些时髦的东西优化发生的事情。

  • Using -H can decrease the time spent in GC.
  • Try to eliminate the conversions from lists to repas to vectors.
  • All those CAFs (top level constant data structures) are kinda weird -- a real program wouldn't be a list of top level constants -- in fact, this module is pathologically so, causing lots of values to be retained over long periods, instead of being optimized away. Float local definitions inwards.
  • Ask for help from Ben Lippmeier, the author of Repa, particularly since there's some funky optimization stuff happening.

这篇关于的&QUOT最佳途径;遍历一个二维数组&QUOT ;,使用Repa的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆