Haskell在乘加运算上的数学表现 [英] Haskell math performance on multiply-add operation

查看：154 发布时间：2018/6/4 15:48:26 performance math haskell simd

本文介绍了Haskell在乘加运算上的数学表现的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在Haskell编写一个游戏，我在UI中的当前传递涉及很多程序生成的几何。我目前专注于确定一个特定操作（C-ish伪代码）的性能：

  Vec4f乘数，加数; 
 Vec4f vecList []; 
 for（int i = 0; i< count; i ++）
 vecList [i] = vecList [i] * multiplier + addend;

也就是说，四个浮点数的标准乘法加法，这是SIMD优化。

结果会传送到OpenGL顶点缓冲区，因此最终必须将其转储到平面C数组中。出于同样的原因，计算应该可以在C'float'类型上完成。

我已经寻找了一个库或一个原生的惯用解决方案来完成这种排序在Haskell中很快，但是我提出的每个解决方案似乎都在性能的2％左右（即慢50倍），而GCC的C则是正确的。诚然，我几个星期前从Haskell开始，所以我的经验是有限的 - 这就是为什么我要来找你们。你们中的任何一个人可以为更快的Haskell实现提供建议，或者提供有关如何编写高性能Haskell代码的文档指南？首先，最新的Haskell解决方案（时钟约12秒）。我尝试了这个SO贴子中的爆炸模式，但它没有使差异AFAICT。用'（\iv-> v * 4）替换'multAdd'使执行时间缩短到1.9秒，所以逐位填充（以及随之而来的自动优化挑战）似乎没有太多错误。

  { - ＃LANGUAGE BangPatterns＃ - } 
 { - ＃OPTIONS_GHC -O2 -fvia-C -optc-O3 -fexcess -precision -optc-march = native＃ - } 
 
导入Data.Vector.Storable 
将限定的Data.Vector.Storable导入为V 
导入Foreign.C.Types 
 import Data.Bits 
 
 repCount = 10000 
 arraySize = 20000 
 
a = fromList $ [0.2 :: CFloat，0.1,0.6,1.0] 
m = fromList $ [0.99 :: CFloat，0.7,0.8,0.6] 
 
 multAdd :: Int  - > CFloat  - > CFloat 
 multAdd！i！v = v *（m！（i。&。3））+（a！（i。&。3））
 
 multList :: Int - > Vector CFloat  - > Vector CFloat 
 multList！count！src 
 |计数<= 0 = src 
 |否则= multList（count-1）$ V.imap multAdd src 
 $ b $ main = do 
 print $ Data.Vector.Storable.sum $ multList repCount $ 
 Data.Vector。 Storable.replicate（arraySize * 4）（0 :: CFloat）

下面是我在C中的内容。这里的代码有几个#ifdefs，它可以防止它被直接编译;向下滚动测试驱动程序。

#include< stdlib.h>
#include< time.h>

typedef float v4fs __attribute__（（vector_size（16）））;
typedef struct {float x，y，z，w; } Vector4;

void setv4（v4fs * v，float x，float y，float z，float w）{
float * a =（float *）v;
a [0] = x;
a [1] = y;
a [2] = z;
a [3] = w;
}

float sumv4（v4fs * v）{
float * a =（float *）v;
返回a [0] + a [1] + a [2] + a [3];
}

void vecmult（v4fs * MAYBE_RESTRICT s，v4fs * MAYBE_RESTRICT d，v4fs a，v4fs m）{
for（int j = 0; j d [j] = s [j] * m + a;

$ b无效scamult（float * MAYBE_RESTRICT s，float * MAYBE_RESTRICT d，
Vector4 a，Vector4 m）{
for（int j = 0; j <（N * 4）; j + = 4）{
d [j + 0] = s [j + 0] * mx + ax;
d [j + 1] = s [j + 1] * m.y + a.y;
d [j + 2] = s [j + 2] * m.z + a.z;
d [j + 3] = s [j + 3] * m.w + a.w;
}
}

int main（）{
v4fs a，m;
v4fs * s，* d;

setv4（& a，0.2，0.1，0.6，1.0）;
setv4（& m，0.99,0.7,0.8,0.6）;

s = calloc（N，sizeof（v4fs））;
d = s;

double start = clock（）;
for（int i = 0; i
#ifdef COPY
d = malloc（N * sizeof（v4fs））;
#endif

#ifdef VECTOR
vecmult（s，d，a，m）;
#else
Vector4 aa = *（Vector4 *）（& a）;
Vector4 mm = *（Vector4 *）（& m）;
scamult（（float *）s，（float *）d，aa，mm）;
#endif

#ifdef COPY
free（s）;
s = d;
#endif
}
double end = clock（）;

float sum = 0;
for（int j = 0; j sum + = sumv4（s + j）;
}
printf（％ - 50s％2.5f％f\\\
\\\
，NAME，
（end - start）/（double）CLOCKS_PER_SEC，sum）;
}

该脚本将编译并运行多个gcc标志组合的测试。在我的系统上，cmath-64-native-O3-restrict-vector-nocopy的性能最佳，耗时0.22秒。

  import System.Process 
 import GHC.IOBase 
 
 cBase =（cmath，gcc mult.c -ggdb --std = c99 -DM = 10000 -DN = 20000） 
 cOptions = [
 [（32，-m32），（64，-m64）]，
 [（generic，），（ native，-march = native -msse4）]，
 [（O1，-O1），（O2，-O2），（O3，-O3 ）]，
 [（restrict，-DMAYBE_RESTRICT = __ restrict__），
（norestrict，-DMAYBE_RESTRICT =）]，
 [（vector， -DVECTOR），（scalar，）]，
 [（copy，-DCOPY），（nocopy，）] 
] 
 
  - 折叠双列表的笛卡尔乘积。可能是一个前奏功能
  - 或者两个这样做，但嘿。 '烫发'提到排列，直到我意识到
  - 这实际上并没有做排列。 '
 permfold ::（a  - > a  - > a） - > a  - > [[a]]  - > [a] 
 permfold fz [] = [z] 
 permfold fz（x：xs）= concat $ map（\a-  - >（permfold f（fza）xs））x 
 
 prepCmd ::（String，String） - > （String，String） - > （String，String）
 prepCmd（name，cmd）（namea，cmda）= 
（name ++ - ++ namea，cmd ++++ cmda）
 
 runCCmd name compileCmd = do 
 res<  -  system（compileCmd ++-DNAME = \\\\++ name ++\\\-o+ + name）
 if res == ExitSuccess 
然后执行system（./++ name）
 return（）
 else putStrLn $ name ++did not compile 
 
 main = do 
 mapM_（uncurry runCCmd）$ permfold prepCmd cBase cOptions

实际上，核心看起来大致可以确定为
我。使用unsafeIndex而不是（！）
使程序以
快两倍以上（请参阅上面的答案）。下面的
程序要快得多，但是
（和更干净的IMO）。我怀疑C计划与b $ b之间的剩余
差额是由于GHC的浮动
点的一般
吮吸。 HEAD通过NCG和产生
的最佳结果首先，定义一个新的Vec4数据类型：

{ - ＃LANGUAGE BangPatterns＃ - } 导入Data.Vector.Storable 将限定的Data.Vector.Storable导入为V import Foreign import Foreign.C.Types - 定义一个4元素的向量类型 data Vec4 = Vec4 { - ＃UNPACK＃ - }！CFloat { - ＃UNPACK＃ - }！CFloat { - ＃UNPACK＃ - }！CFloat { - ＃UNPACK＃ - }！CFloat
确保我们可以将它存储在一个数组中
instance Storable Vec4 where sizeOf _ = sizeOf（undefined :: CFloat）* 4 alignment _ = alignment（undefined :: CFloat） { - ＃INLINE peek＃ - } peek p = do a< - peekElemOff q 0 b< - peekElemOff q 1 c< - peekElemOff q 2 d< - peekElemOff q 3 return（Vec4 abcd）其中 q = castPtr p { - ＃INLINE poke＃ - } poke p（Vec4 abcd）= do pokeElemOff q 0 a pokeElemOff q 1 b pokeElemOff q 2 c pokeElemOff q 3 d 其中 q = castPtr p
类型：

a = Vec4 0.2 0.1 0.6 1.0 m = Vec4 0.99 0.7 0.8 0.6 add :: Vec4 - > Vec4 - > Vec4 { - ＃INLINE add＃ - } add（Vec4 abcd）（Vec4 a'b'c'd'）= Vec4（a + a'）（b + b'）（c + c'）（d + d'） mult :: Vec4 - > Vec4 - > Vec4 { - #INLINE mult＃ - } mult（Vec4 abcd）（Vec4 a'b'c'd'）= Vec4（a * a'）（b * b'）（c * c'）（d * d'） vsum :: Vec4 - > CFloat { - #INLINE vsum＃ - } vsum（Vec4 a b c d）= a + b + c + d multList :: Int - > Vector Vec4 - > Vector Vec4 multList！count！src |计数<= 0 = src |否则= multList（count-1）$ V.map（\v-> add（mult vm）a）src main = do print $ Data.Vector.Storable。 sum $ Data.Vector.Storable.map vsum $ multList repCount $ Data.Vector.Storable.replicate arraySize（Vec4 0 0 0 0） repCount ，arraySize :: Int repCount = 10000 arraySize = 20000
使用ghc 6.12.1，-O2 -fasm：

<1.75>

使用ghc HEAD（6月26日），-O2 -fasm -msse2

1.708

这看起来是编写Vec4阵列最习惯的方式，并且性能最佳（比原来快11倍）。（这可能会成为GHC的LLVM后端的基准）
I'm writing a game in Haskell, and my current pass at the UI involves a lot of procedural generation of geometry. I am currently focused on identifying performance of one particular operation (C-ish pseudocode):
Vec4f multiplier, addend; Vec4f vecList[]; for (int i = 0; i < count; i++) vecList[i] = vecList[i] * multiplier + addend;
That is, a bog-standard multiply-add of four floats, the kind of thing ripe for SIMD optimization.

The result is going to an OpenGL vertex buffer, so it has to get dumped into a flat C array eventually. For the same reason, the calculations should probably be done on C 'float' types.

I've looked for either a library or a native idiomatic solution to do this sort of thing quickly in Haskell, but every solution I've come up with seems to hover around 2% of the performance (that is, 50x slower) compared to C from GCC with the right flags. Granted, I started with Haskell a couple weeks ago, so my experience is limited—which is why I'm coming to you guys. Can any of you offer suggestions for a faster Haskell implementation, or pointers to documentation on how to write high-performance Haskell code?
First, the most recent Haskell solution (clocks about 12 seconds). I tried the bang-patterns stuff from this SO post, but it didn't make a difference AFAICT. Replacing 'multAdd' with '(\i v -> v * 4)' brought execution time down to 1.9 seconds, so the bitwise stuff (and consequent challenges to automatic optimization) doesn't seem to be too much at fault.
{-# LANGUAGE BangPatterns #-} {-# OPTIONS_GHC -O2 -fvia-C -optc-O3 -fexcess-precision -optc-march=native #-} import Data.Vector.Storable import qualified Data.Vector.Storable as V import Foreign.C.Types import Data.Bits repCount = 10000 arraySize = 20000 a = fromList $ [0.2::CFloat, 0.1, 0.6, 1.0] m = fromList $ [0.99::CFloat, 0.7, 0.8, 0.6] multAdd :: Int -> CFloat -> CFloat multAdd !i !v = v * (m ! (i .&. 3)) + (a ! (i .&. 3)) multList :: Int -> Vector CFloat -> Vector CFloat multList !count !src | count <= 0 = src | otherwise = multList (count-1) $ V.imap multAdd src main = do print $ Data.Vector.Storable.sum $ multList repCount $ Data.Vector.Storable.replicate (arraySize*4) (0::CFloat)
Here's what I have in C. The code here has a few #ifdefs which prevents it from being compiled straight-up; scroll down for the test driver.
#include <stdio.h> #include <stdlib.h> #include <time.h> typedef float v4fs __attribute__ ((vector_size (16))); typedef struct { float x, y, z, w; } Vector4; void setv4(v4fs *v, float x, float y, float z, float w) { float *a = (float*) v; a[0] = x; a[1] = y; a[2] = z; a[3] = w; } float sumv4(v4fs *v) { float *a = (float*) v; return a[0] + a[1] + a[2] + a[3]; } void vecmult(v4fs *MAYBE_RESTRICT s, v4fs *MAYBE_RESTRICT d, v4fs a, v4fs m) { for (int j = 0; j < N; j++) { d[j] = s[j] * m + a; } } void scamult(float *MAYBE_RESTRICT s, float *MAYBE_RESTRICT d, Vector4 a, Vector4 m) { for (int j = 0; j < (N*4); j+=4) { d[j+0] = s[j+0] * m.x + a.x; d[j+1] = s[j+1] * m.y + a.y; d[j+2] = s[j+2] * m.z + a.z; d[j+3] = s[j+3] * m.w + a.w; } } int main () { v4fs a, m; v4fs *s, *d; setv4(&a, 0.2, 0.1, 0.6, 1.0); setv4(&m, 0.99, 0.7, 0.8, 0.6); s = calloc(N, sizeof(v4fs)); d = s; double start = clock(); for (int i = 0; i < M; i++) { #ifdef COPY d = malloc(N * sizeof(v4fs)); #endif #ifdef VECTOR vecmult(s, d, a, m); #else Vector4 aa = *(Vector4*)(&a); Vector4 mm = *(Vector4*)(&m); scamult((float*)s, (float*)d, aa, mm); #endif #ifdef COPY free(s); s = d; #endif } double end = clock(); float sum = 0; for (int j = 0; j < N; j++) { sum += sumv4(s+j); } printf("%-50s %2.5f %f\n\n", NAME, (end - start) / (double) CLOCKS_PER_SEC, sum); }
This script will compile and run the tests with a number of gcc flag combinations. The best performance was had by cmath-64-native-O3-restrict-vector-nocopy on my system, taking 0.22 seconds.
import System.Process import GHC.IOBase cBase = ("cmath", "gcc mult.c -ggdb --std=c99 -DM=10000 -DN=20000") cOptions = [ [("32", "-m32"), ("64", "-m64")], [("generic", ""), ("native", "-march=native -msse4")], [("O1", "-O1"), ("O2", "-O2"), ("O3", "-O3")], [("restrict", "-DMAYBE_RESTRICT=__restrict__"), ("norestrict", "-DMAYBE_RESTRICT=")], [("vector", "-DVECTOR"), ("scalar", "")], [("copy", "-DCOPY"), ("nocopy", "")] ] -- Fold over the Cartesian product of the double list. Probably a Prelude function -- or two that does this, but hey. The 'perm' referred to permutations until I realized -- that this wasn't actually doing permutations. ' permfold :: (a -> a -> a) -> a -> [[a]] -> [a] permfold f z [] = [z] permfold f z (x:xs) = concat $ map (\a -> (permfold f (f z a) xs)) x prepCmd :: (String, String) -> (String, String) -> (String, String) prepCmd (name, cmd) (namea, cmda) = (name ++ "-" ++ namea, cmd ++ " " ++ cmda) runCCmd name compileCmd = do res <- system (compileCmd ++ " -DNAME=\\\"" ++ name ++ "\\\" -o " ++ name) if res == ExitSuccess then do system ("./" ++ name) return () else putStrLn $ name ++ " did not compile" main = do mapM_ (uncurry runCCmd) $ permfold prepCmd cBase cOptions

解决方案
Roman Leschinkskiy responds:

Actually, the core looks mostly ok to me. Using unsafeIndex instead of (!) makes the program more than twice as fast (see my answer above). The program below is much faster, though (and cleaner, IMO). I suspect the remaining difference between this and the C program is due to GHC's general suckiness when it comes to floating point. The HEAD produces the best results with the NCG and -msse2

First, define a new Vec4 data type:
{-# LANGUAGE BangPatterns #-} import Data.Vector.Storable import qualified Data.Vector.Storable as V import Foreign import Foreign.C.Types -- Define a 4 element vector type data Vec4 = Vec4 {-# UNPACK #-} !CFloat {-# UNPACK #-} !CFloat {-# UNPACK #-} !CFloat {-# UNPACK #-} !CFloat
Ensure we can store it in an array
instance Storable Vec4 where sizeOf _ = sizeOf (undefined :: CFloat) * 4 alignment _ = alignment (undefined :: CFloat) {-# INLINE peek #-} peek p = do a <- peekElemOff q 0 b <- peekElemOff q 1 c <- peekElemOff q 2 d <- peekElemOff q 3 return (Vec4 a b c d) where q = castPtr p {-# INLINE poke #-} poke p (Vec4 a b c d) = do pokeElemOff q 0 a pokeElemOff q 1 b pokeElemOff q 2 c pokeElemOff q 3 d where q = castPtr p
Values and methods on this type:
a = Vec4 0.2 0.1 0.6 1.0 m = Vec4 0.99 0.7 0.8 0.6 add :: Vec4 -> Vec4 -> Vec4 {-# INLINE add #-} add (Vec4 a b c d) (Vec4 a' b' c' d') = Vec4 (a+a') (b+b') (c+c') (d+d') mult :: Vec4 -> Vec4 -> Vec4 {-# INLINE mult #-} mult (Vec4 a b c d) (Vec4 a' b' c' d') = Vec4 (a*a') (b*b') (c*c') (d*d') vsum :: Vec4 -> CFloat {-# INLINE vsum #-} vsum (Vec4 a b c d) = a+b+c+d multList :: Int -> Vector Vec4 -> Vector Vec4 multList !count !src | count <= 0 = src | otherwise = multList (count-1) $ V.map (\v -> add (mult v m) a) src main = do print $ Data.Vector.Storable.sum $ Data.Vector.Storable.map vsum $ multList repCount $ Data.Vector.Storable.replicate arraySize (Vec4 0 0 0 0) repCount, arraySize :: Int repCount = 10000 arraySize = 20000
With ghc 6.12.1, -O2 -fasm:

1.752

With ghc HEAD (june 26), -O2 -fasm -msse2

1.708

This looks like the most idiomatic way to write a Vec4 array, and gets the best performance (11x faster than your original). (And this might become a benchmark for GHC's LLVM backend)

这篇关于Haskell在乘加运算上的数学表现的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

Haskell在乘加运算上的数学表现 [英] Haskell math performance on multiply-add operation

问题描述

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

Haskell在乘加运算上的数学表现 [英] Haskell math performance on multiply-add operation

问题描述

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭