Haskell中合理高效的纯功能矩阵产品? [英] Reasonably efficient pure-functional matrix product in Haskell?

查看:92
本文介绍了Haskell中合理高效的纯功能矩阵产品?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

    $ $在Haskell中可以写下类似矩阵矩阵的产品。 b $ b
  1. 纯功能:无 IO 状态 monad类型签名(我不关心函数体中会发生什么。也就是说,我不在乎函数体是否使用monad,只要整个函数是纯的)。我可能想在纯函数中使用这个矩阵矩阵产品。
  2. 内存安全:不需要malloc或指针。我知道可以在Haskell中写C,
    ,但是会失去内存安全。实际上,使用C语言编写这段代码并将它与Haskell连接也会降低内存安全性。

  3. 像Java那样高效。为了具体,我们假设我是谈论一个简单的三重循环,单精度,连续的列主布局( float [] ,而不是 float [] [] )和大小为1000x1000的矩阵,以及单核CPU。 (如果你每个循环获得0.5-2次浮点运算,你可能会在场上)。


    (我没有希望这听起来像是一个挑战,但请注意,Java可以很容易满足上述所有。)



    我已经知道


    1. 三重循环实现不是最高效的。这是相当缓存遗忘的。在这种情况下,最好使用写得很好的 BLAS 。然而,人们不能总是依靠一个正在尝试做什么的C库。我想知道在正常的Haskell中是否可以编写合理高效的代码。 有些人编写了整个研究论文来演示#3 。但是,我不是计算机科学研究员。我想知道是否有可能在Haskell中保持简单的事情。
    2. Haskell的简洁介绍有一个矩阵产品实现。尽管如此,它仍不能满足上述要求。

    处理评论:


    我有三个原因:首先,没有malloc或指针要求
    尚未定义(我挑战你编写任何Haskell
    代码,它使用没有指针);

    我看到许多Haskell程序没有使用 Ptr 。也许它是指在机器指令级别使用指针的事实?我不是这个意思。我指的是Haskell源代码的抽象级别。


    其次,对CS研究的攻击是不合适的(此外我还
    无法想象比使用其他人已经为你写的
    代码简单得多);第三,在
    Hackage上有许多矩阵软件包(并且要求这个问题的准备工作应该包括
    审查和拒绝每一个)。


    看起来#2和#3是相同的(使用现有库)。我对矩阵产品感兴趣,是对Haskell可以独立完成的简单测试,以及它是否允许您简单地保持简单。我可以很容易地想出一个没有任何准备好的库的数字问题,但是我必须解释这个问题,而每个人都已经知道矩阵产品是什么。


    Java如何满足1.?任何Java方法本质上都是
    :: IORef Arg - > ... - > IORef这个 - > IO Ret


    这实际上是我的问题的根源(+1)。尽管Java并没有声称追踪纯度,但Haskell确实如此。在Java中,该函数是否是纯的在注释中被指示。我可以声称矩阵乘积是纯粹的,即使我在函数体中做了突变。 问题在于Haskell的方法(类型系统中的纯度编码)是否与效率,内存安全性和简单性兼容。 解决方案


像Java一样高效。为了具体,我们假设我正在讨论一个简单的三重循环,单精度,连续的列主布局(float [],而不是float [] [])和大小为1000x1000的矩阵以及单核CPU。 (如果你每个循环得到0.5-2浮点运算,你可能会在场)


所以像

  public class MatrixProd {
static float [] matProd(float [] a,int ra, int ca,float [] b,int rb,int cb){
if(ca!= rb){
throw new IllegalArgumentException(Matrices not fitting);
}
float [] c = new float [ra * cb]; (int j = 0; j float sum = 0($ i
$ for $) ;
for(int k = 0; k sum + = a [i * ca + k] * b [k * cb + j]
}
c [i * cb + j] = sum;
}
}
return c;


static float [] mkMat(int rs,int cs,float x,float d){
float [] arr = new float [rs * cs];
for(int i = 0; i for(int j = 0; j arr [i * cs + j] = x;
x + = d;
}
}
return arr;


public static void main(String [] args){
int sz = 100;
float strt = -32,del = 0.0625f;
if(args.length> 0){
sz = Integer.parseInt(args [0]);
}
if(args.length> 1){
strt = Float.parseFloat(args [1]);
}
if(args.length> 2){
del = Float.parseFloat(args [2]);
}
float [] a = mkMat(sz,sz,strt,del);
float [] b = mkMat(sz,sz,strt-16,del);
System.out.println(a [sz * sz-1]);
System.out.println(b [sz * sz-1]);
long t0 = System.currentTimeMillis();
float [] c = matProd(a,sz,sz,b,sz,sz);
System.out.println(c [sz * sz-1]);
long t1 = System.currentTimeMillis();
double dur =(t1-t0)* 1e-3;
System.out.println(dur);
}
}

我想? (在编码之前我没有正确地读过规格,所以布局是主要的,但由于访问模式是相同的,所以混合布局不会有什么不同,所以我认为没关系。)



我没有花时间思考一个聪明的算法或低级别的优化技巧(在Java中不会达到很多与那些无论如何)。我只是写了简单循环,因为


我不希望这听起来像是一个挑战,但注意Java可以轻松满足上述所有要求


这就是Java给轻松的原因, (b)如果你在每个循环中获得0.5-2次浮点运算,你可能会进入大局。

>

恐怕不是在Java和Haskell中。使用简单的三重循环来达到该吞吐量的缓存未命中次数过多。



在Haskell中做同样的事情,再也没有想过变得聪明,一个简单直接的三重循环: p>

  { - #LANGUAGE BangPatterns# - } 
模块MatProd其中

导入Data.Array。 ST
导入Data.Array.Unboxed

matProd :: UArray Int Float - > Int - > Int - > UArray Int Float - > Int - > Int - > UArray Int Float
matProd a ra ca b rb cb =
let(al,ah)= bounds a
(bl,bh)= bounds b
{ - #INLINE getA# - }
getA ij = a!(i * ca + j)
{ - #INLINE getB# - }
getB ij = b!(i * cb + j)
{ - #INLINE idx# - }
idx ij = i * cb + j
in if al / = 0 ||啊+ 1 / = ra * ca || bl / = 0 || bh + 1 / = rb * cb || $ /
然后错误$矩阵不适合:++ show(ra,ca,al,ah,rb,cb,bl,bh)
else runSTUArray $ do
arr < - newArray(0,ra * cb-1)0
let outer ij
| ra< = i =返回arr
| cb< = j =外(i + 1)0
|否则=做
!x< - 内部ij 0 0
writeArray arr(idx ij)x
外部i(j + 1)
内部ijk!y
| ca< = k =返回y
|否则=内部i j(k + 1)(y + getA i k * getB k j)
外部0 0

mkMat :: Int - > Int - >浮动 - >浮动 - > UArray Int Float
mkMat rs cs xd = runSTUArray $ do
let!r = rs - 1
!c = cs - 1
{ - #INLINE idx# - }
idx ij = cs * i + j
arr < - newArray(0,rs * cs-1)0
let outer ijy
| r<我=返回arr
| c< j =外(i + 1)0 y
|否则= do
writeArray arr(idx ij)y
外部i(j + 1)(y + d)
外部0 0 x

和调用模块

  module Main(main)其中

import System.Environment(getArgs)
import Data.Array.Unboxed

import System.CPUTime
import Text.Printf

导入MatProd

main :: IO()
main = do
args< - getArgs
let(sz,strt,del)= case
的参数(a:b:c:_) - > (阅读a,阅读b,阅读c)
(a:b:_) - > (阅读a,阅读b,0.0625)
(a:_) - > (读取a,-32,0.0625)
_ - > (100,-32,0.0625)
a = mkMat sz sz strt del
b = mkMat sz sz(strt - 16)del
print(a!(sz * sz-1))
print(b!(sz * sz-1))
t0 < - getCPUTime
let c = matProd a sz sz b sz sz
print $ c!(sz * sz- 1)
t1 < - getCPUTime
printf%.6f\\\
(fromInteger(t1-t0)* 1e-12 :: Double)

所以我们在两种语言中做的事情几乎完全一样。使用 -O2 编译Haskell,带javac的Java

  $ java MatrixProd 1000-13.70.013 
12915.623
12899.999
8.3592897E10
8.193
$ ./vmmult 1000-13.70.013
12915.623
12899.999
8.35929e10
8.558699

结果时间非常接近。



如果我们将Java代码编译为本地代码,使用 gcj -O3 -Wall -Wextra -main = MatrixProd -fno-bounds-check -fno-store-check -o jmatProd MatrixProd.java

  $ ./jmatProd 1000 -13.70.013 
12915.623
12899.999
8.3592896512E10
8.215



$ p
$ b

作为一个特别的奖励,C(gcc -O3)中的算法是相同的:

  $ ./cmatProd 1000-13.70.013 
12915.623047
12899.999023
8.35929e + 10
8.079759

所以这没有显示出根本性差异当涉及到使用浮点数的计算密集型任务时(在处理中等到大数的整数算术时,GHC使用GMP使得Haskell在很多任务上的性能远远超过Java的BigInteger,但是在直接的Java和Haskell之间)这当然是一个图书馆的问题,而不是语言的问题),而且这两个算法都与这个算法接近。



然而,平心而论,这是因为访问模式导致每隔一纳秒发生一次cache-miss,所以在所有三种语言中,这种计算都是内存限制的。



如果我们通过乘以行主矩阵列主矩阵都变得更快,gcc编译的C完成1.18s,java需要1.23s,ghc编译的Haskell需要5.8s左右,使用llvm后端可以减少到3秒。 / p>

在这里,数组库的范围检查真的很痛苦。使用未经检查的数组访问(在检查错误之后,由于检查已经在控制循环的代码中完成),GHC的本机后端在2.4s内完成,通过llvm后端让计算在1.55s内完成,虽然比C和Java都慢很多,但它还算不错。使用 GHC.Prim 中的原语而不是数组库, llvm后端生成的代码运行在1.16s(再次,没有边界检查每个访问,但只有有效的索引在计算过程中产生,在这种情况下很容易在之前证明,所以在这里,没有牺牲内存安全性。检查每个访问会使时间达到1.96s,仍然明显优于检查数组库的边界。)底线:GHC需要(很多)更快的边界检查分支,并且优化器还有改进的空间,但原则上,Haskell的方法(纯度编码在类型系统)兼容效率,内存安全性和简单性,我们还没有出现。目前,人们必须确定一个人愿意牺牲多少点。






¹是的,这就是一个特殊的情况,通常省略边界检查会牺牲内存安全性,或者至少证明它不是很难。


I know Haskell a little bit, and I wonder if it's possible to write something like a matrix-matrix product in Haskell that is all of the following:

  1. Pure-functional: no IO or State monads in its type signature (I don't care what happens in the function body. That is, I don't care if the function body uses monads, as long as the whole function is pure). I may want to use this matrix-matrix product in a pure function.
  2. Memory-safe: no malloc or pointers. I know that it's possible to "write C" in Haskell, but you lose memory safety. Actually writing this code in C and interfacing it with Haskell also loses memory safety.
  3. As efficient as, say, Java. For concreteness, let's assume I'm talking about a simple triple loop, single precision, contiguous column-major layout (float[], not float[][]) and matrices of size 1000x1000, and a single-core CPU. (If you are getting 0.5-2 floating point operations per cycle, you are probably in the ballpark.)

(I don't want this to sound like a challenge, but note that Java can satisfy all of the above easily.)

I already know that

  1. The triple loop implementation is not the most efficient one. It's quite cache-oblivious. It's better to use a well-written BLAS implementation in this particular case. However, one can not always count on a C library being available for what one is trying to do. I wonder if reasonably efficient code can be written in normal Haskell.
  2. Some people wrote whole research papers that demonstrate #3. However, I'm not a computer science researcher. I wonder if it's possible to keep simple things simple in Haskell.
  3. The Gentle Introduction to Haskell has a matrix product implementation. It wouldn't satisfy the above requirements though.

Addressing comments:

I have three reasons: first, the "no malloc or pointers" requirement is as yet ill-defined (I challenge you to write any piece of Haskell code which uses no pointers);

I saw plenty of Haskell programs not using Ptr. Perhaps it refers to the fact that at the machine instruction level, pointers will be used? That's not what I meant. I was referring to the abstraction level of the Haskell source code.

second, the attack on CS research is out of place (and furthermore I can't imagine anything simpler than using code somebody else has already written for you); third, there are many matrix packages on Hackage (and the prep work for asking this question should include reviewing and rejecting each).

It seems that your #2 and #3 are the same ("use existing libraries"). I'm interested in the matrix product as a simple test of what Haskell can do on its own, and whether it allows you to "keep simple things simple". I could have easily come up with a numerical problem that doesn't have any ready libraries, but then I'd have to explain the problem, whereas everyone already knows what a matrix product is.

How can Java possibly satisfy 1.? Any Java method is essentially :: IORef Arg -> ... -> IORef This -> IO Ret

This goes to the root of my question, actually (+1). While Java does not claim to track purity, Haskell does. In Java, whether the function is pure or not is indicated in the comments. I can claim that the matrix product is pure, even though I do mutation in the function body. The question is whether Haskell's approach (purity encoded in the type system) is compatible with efficiency, memory-safety and simplicity.

解决方案

As efficient as, say, Java. For concreteness, let's assume I'm talking about a simple triple loop, single precision, contiguous column-major layout (float[], not float[][]) and matrices of size 1000x1000, and a single-core CPU. (If you are getting 0.5-2 floating point operations per cycle, you are probably in the ballpark)

So something like

public class MatrixProd {
    static float[] matProd(float[] a, int ra, int ca, float[] b, int rb, int cb) {
        if (ca != rb) {
            throw new IllegalArgumentException("Matrices not fitting");
        }
        float[] c = new float[ra*cb];
        for(int i = 0; i < ra; ++i) {
            for(int j = 0; j < cb; ++j) {
                float sum = 0;
                for(int k = 0; k < ca; ++k) {
                    sum += a[i*ca+k]*b[k*cb+j];
                }
                c[i*cb+j] = sum;
            }
        }
        return c;
    }

    static float[] mkMat(int rs, int cs, float x, float d) {
        float[] arr = new float[rs*cs];
        for(int i = 0; i < rs; ++i) {
            for(int j = 0; j < cs; ++j) {
                arr[i*cs+j] = x;
                x += d;
            }
        }
        return arr;
    }

    public static void main(String[] args) {
        int sz = 100;
        float strt = -32, del = 0.0625f;
        if (args.length > 0) {
            sz = Integer.parseInt(args[0]);
        }
        if (args.length > 1) {
            strt = Float.parseFloat(args[1]);
        }
        if (args.length > 2) {
            del = Float.parseFloat(args[2]);
        }
        float[] a = mkMat(sz,sz,strt,del);
        float[] b = mkMat(sz,sz,strt-16,del);
        System.out.println(a[sz*sz-1]);
        System.out.println(b[sz*sz-1]);
        long t0 = System.currentTimeMillis();
        float[] c = matProd(a,sz,sz,b,sz,sz);
        System.out.println(c[sz*sz-1]);
        long t1 = System.currentTimeMillis();
        double dur = (t1-t0)*1e-3;
        System.out.println(dur);
    }
}

I suppose? (I hadn't read the specs properly before coding, so the layout is row-major, but since the access pattern is the same, that doesn't make a difference as mixing layouts would, so I'll assume that's okay.)

I haven't spent any time on thinking about a clever algorithm or low-level optimisation tricks (I wouldn't achieve much in Java with those anyway). I just wrote the simple loop, because

I don't want this to sound like a challenge, but note that Java can satisfy all of the above easily

And that's what Java gives easily, so I'll take that.

(If you are getting 0.5-2 floating point operations per cycle, you are probably in the ballpark)

Nowhere near, I'm afraid, neither in Java nor in Haskell. Too many cache misses to reach that throughput with the simple triple loop.

Doing the same in Haskell, again no thinking about being clever, a plain straightforward triple loop:

{-# LANGUAGE BangPatterns #-}
module MatProd where

import Data.Array.ST
import Data.Array.Unboxed

matProd :: UArray Int Float -> Int -> Int -> UArray Int Float -> Int -> Int -> UArray Int Float
matProd a ra ca b rb cb =
    let (al,ah)     = bounds a
        (bl,bh)     = bounds b
        {-# INLINE getA #-}
        getA i j    = a!(i*ca + j)
        {-# INLINE getB #-}
        getB i j    = b!(i*cb + j)
        {-# INLINE idx #-}
        idx i j     = i*cb + j
    in if al /= 0 || ah+1 /= ra*ca || bl /= 0 || bh+1 /= rb*cb || ca /= rb
         then error $ "Matrices not fitting: " ++ show (ra,ca,al,ah,rb,cb,bl,bh)
         else runSTUArray $ do
            arr <- newArray (0,ra*cb-1) 0
            let outer i j
                    | ra <= i   = return arr
                    | cb <= j   = outer (i+1) 0
                    | otherwise = do
                        !x <- inner i j 0 0
                        writeArray arr (idx i j) x
                        outer i (j+1)
                inner i j k !y
                    | ca <= k   = return y
                    | otherwise = inner i j (k+1) (y + getA i k * getB k j)
            outer 0 0

mkMat :: Int -> Int -> Float -> Float -> UArray Int Float
mkMat rs cs x d = runSTUArray $ do
    let !r = rs - 1
        !c = cs - 1
        {-# INLINE idx #-}
        idx i j = cs*i + j
    arr <- newArray (0,rs*cs-1) 0
    let outer i j y
            | r < i     = return arr
            | c < j     = outer (i+1) 0 y
            | otherwise = do
                writeArray arr (idx i j) y
                outer i (j+1) (y + d)
    outer 0 0 x

and the calling module

module Main (main) where

import System.Environment (getArgs)
import Data.Array.Unboxed

import System.CPUTime
import Text.Printf

import MatProd

main :: IO ()
main = do
    args <- getArgs
    let (sz, strt, del) = case args of
                            (a:b:c:_) -> (read a, read b, read c)
                            (a:b:_)   -> (read a, read b, 0.0625)
                            (a:_)     -> (read a, -32, 0.0625)
                            _         -> (100, -32, 0.0625)
        a = mkMat sz sz strt del
        b = mkMat sz sz (strt - 16) del
    print (a!(sz*sz-1))
    print (b!(sz*sz-1))
    t0 <- getCPUTime
    let c = matProd a sz sz b sz sz
    print $ c!(sz*sz-1)
    t1 <- getCPUTime
    printf "%.6f\n" (fromInteger (t1-t0)*1e-12 :: Double)

So we're doing almost exactly the same things in both languages. Compile the Haskell with -O2, the Java with javac

$ java MatrixProd 1000 "-13.7" 0.013
12915.623
12899.999
8.3592897E10
8.193
$ ./vmmult 1000 "-13.7" 0.013
12915.623
12899.999
8.35929e10
8.558699

And the resulting times are quite close.

And if we compile the Java code to native, with gcj -O3 -Wall -Wextra --main=MatrixProd -fno-bounds-check -fno-store-check -o jmatProd MatrixProd.java,

$ ./jmatProd 1000 "-13.7" 0.013
12915.623
12899.999
8.3592896512E10
8.215

there's still no big difference.

As a special bonus, the same algorithm in C (gcc -O3):

$ ./cmatProd 1000 "-13.7" 0.013
12915.623047
12899.999023
8.35929e+10
8.079759

So this reveals no fundamental difference between straightforward Java and straightforward Haskell when it comes to computationally intensive tasks using floating point numbers (when dealing with integer arithmetic on medium to large numbers, the use of GMP by GHC makes Haskell outperform Java's BigInteger by a huge margin for many tasks, but that is of course a library issue, not a language one), and both are close to C with this algorithm.

In all fairness, though, that is because the access pattern causes a cache-miss every other nanosecond, so in all three languages this computation is memory-bound.

If we improve the access pattern by multiplying a row-major matrix with a column-major matrix, all become faster, the gcc-compiled C finishes it 1.18s, java takes 1.23s and the ghc-compiled Haskell takes around 5.8s, which can be reduced to 3 seconds by using the llvm backend.

Here, the range-check by the array library really hurts. Using the unchecked array access (as one should, after checking for bugs, since the checks are already done in the code controlling the loops), GHC's native backend finishes in 2.4s, going via the llvm backend lets the computation finish in 1.55s, which is decent, although significantly slower than both C and Java. Using the primitives from GHC.Prim instead of the array library, the llvm backend produces code that runs in 1.16s (again, without bounds-checking on each access, but that only valid indices are produced during the computation can in this case easily be proved before, so here, no memory-safety is sacrificed¹; checking each access brings the time up to 1.96s, still significantly better than the bounds checking of the array library).

Bottom line: GHC needs (much) faster branching for the bounds-checking, and there's room for improvement in the optimiser, but in principle, "Haskell's approach (purity encoded in the type system) is compatible with efficiency, memory-safety and simplicity", we're just not yet there. For the time being, one has to decide how much of which point one is willing to sacrifice.


¹ Yes, that's a special case, in general omitting the bounds-check does sacrifice memory-safety, or it is at least harder to prove that it doesn't.

这篇关于Haskell中合理高效的纯功能矩阵产品?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆