Haskell在乘加运算上的数学表现 [英] Haskell math performance on multiply-add operation

查看:154
本文介绍了Haskell在乘加运算上的数学表现的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在Haskell编写一个游戏,我在UI中的当前传递涉及很多程序生成的几何。我目前专注于确定一个特定操作(C-ish伪代码)的性能:

  Vec4f乘数,加数; 
Vec4f vecList [];
for(int i = 0; i< count; i ++)
vecList [i] = vecList [i] * multiplier + addend;

也就是说,四个浮点数的标准乘法加法,这是SIMD优化。

结果会传送到OpenGL顶点缓冲区,因此最终必须将其转储到平面C数组中。出于同样的原因,计算应该可以在C'float'类型上完成。

我已经寻找了一个库或一个原生的惯用解决方案来完成这种排序在Haskell中很快,但是我提出的每个解决方案似乎都在性能的2%左右(即慢50倍),而GCC的C则是正确的。诚然,我几个星期前从Haskell开始,所以我的经验是有限的 - 这就是为什么我要来找你们。你们中的任何一个人可以为更快的Haskell实现提供建议,或者提供有关如何编写高性能Haskell代码的文档指南?首先,最新的Haskell解决方案(时钟约12秒)。我尝试了这个SO贴子中的爆炸模式,但它没有使差异AFAICT。用'(\iv-> v * 4)替换'multAdd'使执行时间缩短到1.9秒,所以逐位填充(以及随之而来的自动优化挑战)似乎没有太多错误。

  { - #LANGUAGE BangPatterns# - } 
{ - #OPTIONS_GHC -O2 -fvia-C -optc-O3 -fexcess -precision -optc-march = native# - }

导入Data.Vector.Storable
将限定的Data.Vector.Storable导入为V
导入Foreign.C.Types
import Data.Bits

repCount = 10000
arraySize = 20000

a = fromList $ [0.2 :: CFloat,0.1,0.6,1.0]
m = fromList $ [0.99 :: CFloat,0.7,0.8,0.6]

multAdd :: Int - > CFloat - > CFloat
multAdd!i!v = v *(m!(i。&。3))+(a!(i。&。3))

multList :: Int - > Vector CFloat - > Vector CFloat
multList!count!src
|计数<= 0 = src
|否则= multList(count-1)$ V.imap multAdd src
$ b $ main = do
print $ Data.Vector.Storable.sum $ multList repCount $
Data.Vector。 Storable.replicate(arraySize * 4)(0 :: CFloat)

下面是我在C中的内容。这里的代码有几个#ifdefs,它可以防止它被直接编译;向下滚动测试驱动程序。


#include< stdlib.h>
#include< time.h>

typedef float v4fs __attribute__((vector_size(16)));
typedef struct {float x,y,z,w; } Vector4;

void setv4(v4fs * v,float x,float y,float z,float w){
float * a =(float *)v;
a [0] = x;
a [1] = y;
a [2] = z;
a [3] = w;
}

float sumv4(v4fs * v){
float * a =(float *)v;
返回a [0] + a [1] + a [2] + a [3];
}

void vecmult(v4fs * MAYBE_RESTRICT s,v4fs * MAYBE_RESTRICT d,v4fs a,v4fs m){
for(int j = 0; j d [j] = s [j] * m + a;


$ b无效scamult(float * MAYBE_RESTRICT s,float * MAYBE_RESTRICT d,
Vector4 a,Vector4 m){
for(int j = 0; j <(N * 4); j + = 4){
d [j + 0] = s [j + 0] * mx + ax;
d [j + 1] = s [j + 1] * m.y + a.y;
d [j + 2] = s [j + 2] * m.z + a.z;
d [j + 3] = s [j + 3] * m.w + a.w;
}
}

int main(){
v4fs a,m;
v4fs * s,* d;

setv4(& a,0.2,0.1,0.6,1.0);
setv4(& m,0.99,0.7,0.8,0.6);

s = calloc(N,sizeof(v4fs));
d = s;

double start = clock();
for(int i = 0; i
#ifdef COPY
d = malloc(N * sizeof(v4fs));
#endif

#ifdef VECTOR
vecmult(s,d,a,m);
#else
Vector4 aa = *(Vector4 *)(& a);
Vector4 mm = *(Vector4 *)(& m);
scamult((float *)s,(float *)d,aa,mm);
#endif

#ifdef COPY
free(s);
s = d;
#endif
}
double end = clock();

float sum = 0;
for(int j = 0; j sum + = sumv4(s + j);
}
printf(% - 50s%2.5f%f\\\
\\\
,NAME,
(end - start)/(double)CLOCKS_PER_SEC,sum);
}

该脚本将编译并运行多个gcc标志组合的测试。在我的系统上,cmath-64-native-O3-restrict-vector-nocopy的性能最佳,耗时0.22秒。

  import System.Process 
import GHC.IOBase

cBase =(cmath,gcc mult.c -ggdb --std = c99 -DM = 10000 -DN = 20000)
cOptions = [
[(32,-m32),(64,-m64)],
[(generic,),( native,-march = native -msse4)],
[(O1,-O1),(O2,-O2),(O3,-O3 )],
[(restrict,-DMAYBE_RESTRICT = __ restrict__),
(norestrict,-DMAYBE_RESTRICT =)],
[(vector, -DVECTOR),(scalar,)],
[(copy,-DCOPY),(nocopy,)]
]

- 折叠双列表的笛卡尔乘积。可能是一个前奏功能
- 或者两个这样做,但嘿。 '烫发'提到排列,直到我意识到
- 这实际上并没有做排列。 '
permfold ::(a - > a - > a) - > a - > [[a]] - > [a]
permfold fz [] = [z]
permfold fz(x:xs)= concat $ map(\a- - >(permfold f(fza)xs))x

prepCmd ::(String,String) - > (String,String) - > (String,String)
prepCmd(name,cmd)(namea,cmda)=
(name ++ - ++ namea,cmd ++++ cmda)

runCCmd name compileCmd = do
res< - system(compileCmd ++-DNAME = \\\\++ name ++\\\-o+ + name)
if res == ExitSuccess
然后执行system(./++ name)
return()
else putStrLn $ name ++did not compile

main = do
mapM_(uncurry runCCmd)$ permfold prepCmd cBase cOptions




实际上,核心看起来大致可以确定为
我。使用unsafeIndex而不是(!)
使程序以
快两倍以上(请参阅上面的答案)。下面的
程序要快得多,但是
(和更干净的IMO)。我怀疑C计划与b $ b之间的剩余
差额是由于GHC的浮动
点的一般
吮吸。 HEAD通过NCG和产生
的最佳结果首先,定义一个新的Vec4数据类型:




  { - #LANGUAGE BangPatterns# - } 

导入Data.Vector.Storable
将限定的Data.Vector.Storable导入为V
import Foreign
import Foreign.C.Types

- 定义一个4元素的向量类型
data Vec4 = Vec4 { - #UNPACK# - }!CFloat
{ - #UNPACK# - }!CFloat
{ - #UNPACK# - }!CFloat
{ - #UNPACK# - }!CFloat

确保我们可以将它存储在一个数组中

  instance Storable Vec4 where 
sizeOf _ = sizeOf(undefined :: CFloat)* 4
alignment _ = alignment(undefined :: CFloat)

{ - #INLINE peek# - }
peek p = do
a< - peekElemOff q 0
b< - peekElemOff q 1
c< - peekElemOff q 2
d< - peekElemOff q 3
return(Vec4 abcd)
其中
q = castPtr p
{ - #INLINE poke# - }
poke p(Vec4 abcd)= do
pokeElemOff q 0 a
pokeElemOff q 1 b
pokeElemOff q 2 c
pokeElemOff q 3 d
其中
q = castPtr p

类型:

  a = Vec4 0.2 0.1 0.6 1.0 
m = Vec4 0.99 0.7 0.8 0.6

add :: Vec4 - > Vec4 - > Vec4
{ - #INLINE add# - }
add(Vec4 abcd)(Vec4 a'b'c'd')= Vec4(a + a')(b + b')(c + c')(d + d')

mult :: Vec4 - > Vec4 - > Vec4
{ - #INLINE mult# - }
mult(Vec4 abcd)(Vec4 a'b'c'd')= Vec4(a * a')(b * b')(c * c')(d * d')

vsum :: Vec4 - > CFloat
{ - #INLINE vsum# - }
vsum(Vec4 a b c d)= a + b + c + d

multList :: Int - > Vector Vec4 - > Vector Vec4
multList!count!src
|计数<= 0 = src
|否则= multList(count-1)$ V.map(\v-> add(mult vm)a)src

main = do
print $ Data.Vector.Storable。 sum
$ Data.Vector.Storable.map vsum
$ multList repCount
$ Data.Vector.Storable.replicate arraySize(Vec4 0 0 0 0)

repCount ,arraySize :: Int
repCount = 10000
arraySize = 20000

使用ghc 6.12.1,-O2 -fasm:


<1.75>


使用ghc HEAD(6月26日),-O2 -fasm -msse2



这看起来是编写Vec4阵列最习惯的方式,并且性能最佳(比原来快11倍)。 (这可能会成为GHC的LLVM后端的基准)

I'm writing a game in Haskell, and my current pass at the UI involves a lot of procedural generation of geometry. I am currently focused on identifying performance of one particular operation (C-ish pseudocode):

Vec4f multiplier, addend;
Vec4f vecList[];
for (int i = 0; i < count; i++)
    vecList[i] = vecList[i] * multiplier + addend;

That is, a bog-standard multiply-add of four floats, the kind of thing ripe for SIMD optimization.

The result is going to an OpenGL vertex buffer, so it has to get dumped into a flat C array eventually. For the same reason, the calculations should probably be done on C 'float' types.

I've looked for either a library or a native idiomatic solution to do this sort of thing quickly in Haskell, but every solution I've come up with seems to hover around 2% of the performance (that is, 50x slower) compared to C from GCC with the right flags. Granted, I started with Haskell a couple weeks ago, so my experience is limited—which is why I'm coming to you guys. Can any of you offer suggestions for a faster Haskell implementation, or pointers to documentation on how to write high-performance Haskell code?

First, the most recent Haskell solution (clocks about 12 seconds). I tried the bang-patterns stuff from this SO post, but it didn't make a difference AFAICT. Replacing 'multAdd' with '(\i v -> v * 4)' brought execution time down to 1.9 seconds, so the bitwise stuff (and consequent challenges to automatic optimization) doesn't seem to be too much at fault.

{-# LANGUAGE BangPatterns #-}
{-# OPTIONS_GHC -O2 -fvia-C -optc-O3 -fexcess-precision -optc-march=native #-}

import Data.Vector.Storable
import qualified Data.Vector.Storable as V
import Foreign.C.Types
import Data.Bits

repCount = 10000
arraySize = 20000

a = fromList $ [0.2::CFloat,  0.1, 0.6, 1.0]
m = fromList $ [0.99::CFloat, 0.7, 0.8, 0.6]

multAdd :: Int -> CFloat -> CFloat
multAdd !i !v = v * (m ! (i .&. 3)) + (a ! (i .&. 3))

multList :: Int -> Vector CFloat -> Vector CFloat
multList !count !src
    | count <= 0    = src
    | otherwise     = multList (count-1) $ V.imap multAdd src

main = do
    print $ Data.Vector.Storable.sum $ multList repCount $ 
        Data.Vector.Storable.replicate (arraySize*4) (0::CFloat)

Here's what I have in C. The code here has a few #ifdefs which prevents it from being compiled straight-up; scroll down for the test driver.

#include <stdio.h>
#include <stdlib.h>
#include <time.h>

typedef float v4fs __attribute__ ((vector_size (16)));
typedef struct { float x, y, z, w; } Vector4;

void setv4(v4fs *v, float x, float y, float z, float w) {
    float *a = (float*) v;
    a[0] = x;
    a[1] = y;
    a[2] = z;
    a[3] = w;
}

float sumv4(v4fs *v) {
    float *a = (float*) v;
    return a[0] + a[1] + a[2] + a[3];
}

void vecmult(v4fs *MAYBE_RESTRICT s, v4fs *MAYBE_RESTRICT d, v4fs a, v4fs m) {
    for (int j = 0; j < N; j++) {
        d[j] = s[j] * m + a;
    }
}

void scamult(float *MAYBE_RESTRICT s, float *MAYBE_RESTRICT d,
             Vector4 a, Vector4 m) {
    for (int j = 0; j < (N*4); j+=4) {
        d[j+0] = s[j+0] * m.x + a.x;
        d[j+1] = s[j+1] * m.y + a.y;
        d[j+2] = s[j+2] * m.z + a.z;
        d[j+3] = s[j+3] * m.w + a.w;
    }
}

int main () {
    v4fs a, m;
    v4fs *s, *d;

    setv4(&a, 0.2, 0.1, 0.6, 1.0);
    setv4(&m, 0.99, 0.7, 0.8, 0.6);

    s = calloc(N, sizeof(v4fs));
    d = s;

    double start = clock();
    for (int i = 0; i < M; i++) {

#ifdef COPY
        d = malloc(N * sizeof(v4fs));
#endif

#ifdef VECTOR
        vecmult(s, d, a, m);
#else
        Vector4 aa = *(Vector4*)(&a);
        Vector4 mm = *(Vector4*)(&m);
        scamult((float*)s, (float*)d, aa, mm);
#endif

#ifdef COPY
        free(s);
        s = d;
#endif
    }
    double end = clock();

    float sum = 0;
    for (int j = 0; j < N; j++) {
        sum += sumv4(s+j);
    }
    printf("%-50s %2.5f %f\n\n", NAME,
            (end - start) / (double) CLOCKS_PER_SEC, sum);
}

This script will compile and run the tests with a number of gcc flag combinations. The best performance was had by cmath-64-native-O3-restrict-vector-nocopy on my system, taking 0.22 seconds.

import System.Process
import GHC.IOBase

cBase = ("cmath", "gcc mult.c -ggdb --std=c99 -DM=10000 -DN=20000")
cOptions = [
            [("32", "-m32"), ("64", "-m64")],
            [("generic", ""), ("native", "-march=native -msse4")],
            [("O1", "-O1"), ("O2", "-O2"), ("O3", "-O3")],
            [("restrict", "-DMAYBE_RESTRICT=__restrict__"),
                ("norestrict", "-DMAYBE_RESTRICT=")],
            [("vector", "-DVECTOR"), ("scalar", "")],
            [("copy", "-DCOPY"), ("nocopy", "")]
           ]

-- Fold over the Cartesian product of the double list. Probably a Prelude function
-- or two that does this, but hey. The 'perm' referred to permutations until I realized
-- that this wasn't actually doing permutations. '
permfold :: (a -> a -> a) -> a -> [[a]] -> [a]
permfold f z [] = [z]
permfold f z (x:xs) = concat $ map (\a -> (permfold f (f z a) xs)) x

prepCmd :: (String, String) -> (String, String) -> (String, String)
prepCmd (name, cmd) (namea, cmda) =
    (name ++ "-" ++ namea, cmd ++ " " ++ cmda)

runCCmd name compileCmd = do
    res <- system (compileCmd ++ " -DNAME=\\\"" ++ name ++ "\\\" -o " ++ name)
    if res == ExitSuccess
        then do system ("./" ++ name)
                return ()
        else    putStrLn $ name ++ " did not compile"

main = do
    mapM_ (uncurry runCCmd) $ permfold prepCmd cBase cOptions

解决方案

Roman Leschinkskiy responds:

Actually, the core looks mostly ok to me. Using unsafeIndex instead of (!) makes the program more than twice as fast (see my answer above). The program below is much faster, though (and cleaner, IMO). I suspect the remaining difference between this and the C program is due to GHC's general suckiness when it comes to floating point. The HEAD produces the best results with the NCG and -msse2

First, define a new Vec4 data type:

{-# LANGUAGE BangPatterns #-}

import Data.Vector.Storable
import qualified Data.Vector.Storable as V
import Foreign
import Foreign.C.Types

-- Define a 4 element vector type
data Vec4 = Vec4 {-# UNPACK #-} !CFloat
                 {-# UNPACK #-} !CFloat
                 {-# UNPACK #-} !CFloat
                 {-# UNPACK #-} !CFloat

Ensure we can store it in an array

instance Storable Vec4 where
  sizeOf _ = sizeOf (undefined :: CFloat) * 4
  alignment _ = alignment (undefined :: CFloat)

  {-# INLINE peek #-}
  peek p = do
             a <- peekElemOff q 0
             b <- peekElemOff q 1
             c <- peekElemOff q 2
             d <- peekElemOff q 3
             return (Vec4 a b c d)
    where
      q = castPtr p
  {-# INLINE poke #-}
  poke p (Vec4 a b c d) = do
             pokeElemOff q 0 a
             pokeElemOff q 1 b
             pokeElemOff q 2 c
             pokeElemOff q 3 d
    where
      q = castPtr p

Values and methods on this type:

a = Vec4 0.2 0.1 0.6 1.0
m = Vec4 0.99 0.7 0.8 0.6

add :: Vec4 -> Vec4 -> Vec4
{-# INLINE add #-}
add (Vec4 a b c d) (Vec4 a' b' c' d') = Vec4 (a+a') (b+b') (c+c') (d+d')

mult :: Vec4 -> Vec4 -> Vec4
{-# INLINE mult #-}
mult (Vec4 a b c d) (Vec4 a' b' c' d') = Vec4 (a*a') (b*b') (c*c') (d*d')

vsum :: Vec4 -> CFloat
{-# INLINE vsum #-}
vsum (Vec4 a b c d) = a+b+c+d

multList :: Int -> Vector Vec4 -> Vector Vec4
multList !count !src
    | count <= 0    = src
    | otherwise     = multList (count-1) $ V.map (\v -> add (mult v m) a) src

main = do
    print $ Data.Vector.Storable.sum
          $ Data.Vector.Storable.map vsum
          $ multList repCount
          $ Data.Vector.Storable.replicate arraySize (Vec4 0 0 0 0)

repCount, arraySize :: Int
repCount = 10000
arraySize = 20000

With ghc 6.12.1, -O2 -fasm:

  • 1.752

With ghc HEAD (june 26), -O2 -fasm -msse2

  • 1.708

This looks like the most idiomatic way to write a Vec4 array, and gets the best performance (11x faster than your original). (And this might become a benchmark for GHC's LLVM backend)

这篇关于Haskell在乘加运算上的数学表现的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆