Haskell在乘加运算上的数学表现 [英] Haskell math performance on multiply-add operation
问题描述
Vec4f乘数,加数;
Vec4f vecList [];
for(int i = 0; i< count; i ++)
vecList [i] = vecList [i] * multiplier + addend;
也就是说,四个浮点数的标准乘法加法,这是SIMD优化。
结果会传送到OpenGL顶点缓冲区,因此最终必须将其转储到平面C数组中。出于同样的原因,计算应该可以在C'float'类型上完成。
我已经寻找了一个库或一个原生的惯用解决方案来完成这种排序在Haskell中很快,但是我提出的每个解决方案似乎都在性能的2%左右(即慢50倍),而GCC的C则是正确的。诚然,我几个星期前从Haskell开始,所以我的经验是有限的 - 这就是为什么我要来找你们。你们中的任何一个人可以为更快的Haskell实现提供建议,或者提供有关如何编写高性能Haskell代码的文档指南?首先,最新的Haskell解决方案(时钟约12秒)。我尝试了这个SO贴子中的爆炸模式,但它没有使差异AFAICT。用'(\iv-> v * 4)替换'multAdd'使执行时间缩短到1.9秒,所以逐位填充(以及随之而来的自动优化挑战)似乎没有太多错误。
{ - #LANGUAGE BangPatterns# - }
{ - #OPTIONS_GHC -O2 -fvia-C -optc-O3 -fexcess -precision -optc-march = native# - }
导入Data.Vector.Storable
将限定的Data.Vector.Storable导入为V
导入Foreign.C.Types
import Data.Bits
repCount = 10000
arraySize = 20000
a = fromList $ [0.2 :: CFloat,0.1,0.6,1.0]
m = fromList $ [0.99 :: CFloat,0.7,0.8,0.6]
multAdd :: Int - > CFloat - > CFloat
multAdd!i!v = v *(m!(i。&。3))+(a!(i。&。3))
multList :: Int - > Vector CFloat - > Vector CFloat
multList!count!src
|计数<= 0 = src
|否则= multList(count-1)$ V.imap multAdd src
$ b $ main = do
print $ Data.Vector.Storable.sum $ multList repCount $
Data.Vector。 Storable.replicate(arraySize * 4)(0 :: CFloat)
下面是我在C中的内容。这里的代码有几个#ifdefs,它可以防止它被直接编译;向下滚动测试驱动程序。
#include< stdlib.h>
#include< time.h>
typedef float v4fs __attribute__((vector_size(16)));
typedef struct {float x,y,z,w; } Vector4;
void setv4(v4fs * v,float x,float y,float z,float w){
float * a =(float *)v;
a [0] = x;
a [1] = y;
a [2] = z;
a [3] = w;
}
float sumv4(v4fs * v){
float * a =(float *)v;
返回a [0] + a [1] + a [2] + a [3];
}
void vecmult(v4fs * MAYBE_RESTRICT s,v4fs * MAYBE_RESTRICT d,v4fs a,v4fs m){
for(int j = 0; j
$ b无效scamult(float * MAYBE_RESTRICT s,float * MAYBE_RESTRICT d,
Vector4 a,Vector4 m){
for(int j = 0; j <(N * 4); j + = 4){
d [j + 0] = s [j + 0] * mx + ax;
d [j + 1] = s [j + 1] * m.y + a.y;
d [j + 2] = s [j + 2] * m.z + a.z;
d [j + 3] = s [j + 3] * m.w + a.w;
}
}
int main(){
v4fs a,m;
v4fs * s,* d;
setv4(& a,0.2,0.1,0.6,1.0);
setv4(& m,0.99,0.7,0.8,0.6);
s = calloc(N,sizeof(v4fs));
d = s;
double start = clock();
for(int i = 0; i
#ifdef COPY
d = malloc(N * sizeof(v4fs));
#endif
#ifdef VECTOR
vecmult(s,d,a,m);
#else
Vector4 aa = *(Vector4 *)(& a);
Vector4 mm = *(Vector4 *)(& m);
scamult((float *)s,(float *)d,aa,mm);
#endif
#ifdef COPY
free(s);
s = d;
#endif
}
double end = clock();
float sum = 0;
for(int j = 0; j
}
printf(% - 50s%2.5f%f\\\
\\\
,NAME,
(end - start)/(double)CLOCKS_PER_SEC,sum);
}
该脚本将编译并运行多个gcc标志组合的测试。在我的系统上,cmath-64-native-O3-restrict-vector-nocopy的性能最佳,耗时0.22秒。
import System.Process
import GHC.IOBase
cBase =(cmath,gcc mult.c -ggdb --std = c99 -DM = 10000 -DN = 20000)
cOptions = [
[(32,-m32),(64,-m64)],
[(generic,),( native,-march = native -msse4)],
[(O1,-O1),(O2,-O2),(O3,-O3 )],
[(restrict,-DMAYBE_RESTRICT = __ restrict__),
(norestrict,-DMAYBE_RESTRICT =)],
[(vector, -DVECTOR),(scalar,)],
[(copy,-DCOPY),(nocopy,)]
]
- 折叠双列表的笛卡尔乘积。可能是一个前奏功能
- 或者两个这样做,但嘿。 '烫发'提到排列,直到我意识到
- 这实际上并没有做排列。 '
permfold ::(a - > a - > a) - > a - > [[a]] - > [a]
permfold fz [] = [z]
permfold fz(x:xs)= concat $ map(\a- - >(permfold f(fza)xs))x
prepCmd ::(String,String) - > (String,String) - > (String,String)
prepCmd(name,cmd)(namea,cmda)=
(name ++ - ++ namea,cmd ++++ cmda)
runCCmd name compileCmd = do
res< - system(compileCmd ++-DNAME = \\\\++ name ++\\\-o+ + name)
if res == ExitSuccess
然后执行system(./++ name)
return()
else putStrLn $ name ++did not compile
main = do
mapM_(uncurry runCCmd)$ permfold prepCmd cBase cOptions
实际上,核心看起来大致可以确定为
我。使用unsafeIndex而不是(!)
使程序以
快两倍以上(请参阅上面的答案)。下面的
程序要快得多,但是
(和更干净的IMO)。我怀疑C计划与b $ b之间的剩余
差额是由于GHC的浮动
点的一般
吮吸。 HEAD通过NCG和产生
的最佳结果首先,定义一个新的Vec4数据类型:
{ - #LANGUAGE BangPatterns# - }
导入Data.Vector.Storable
将限定的Data.Vector.Storable导入为V
import Foreign
import Foreign.C.Types
- 定义一个4元素的向量类型
data Vec4 = Vec4 { - #UNPACK# - }!CFloat
{ - #UNPACK# - }!CFloat
{ - #UNPACK# - }!CFloat
{ - #UNPACK# - }!CFloat
确保我们可以将它存储在一个数组中
instance Storable Vec4 where
sizeOf _ = sizeOf(undefined :: CFloat)* 4
alignment _ = alignment(undefined :: CFloat)
{ - #INLINE peek# - }
peek p = do
a< - peekElemOff q 0
b< - peekElemOff q 1
c< - peekElemOff q 2
d< - peekElemOff q 3
return(Vec4 abcd)
其中
q = castPtr p
{ - #INLINE poke# - }
poke p(Vec4 abcd)= do
pokeElemOff q 0 a
pokeElemOff q 1 b
pokeElemOff q 2 c
pokeElemOff q 3 d
其中
q = castPtr p
类型:
a = Vec4 0.2 0.1 0.6 1.0
m = Vec4 0.99 0.7 0.8 0.6
add :: Vec4 - > Vec4 - > Vec4
{ - #INLINE add# - }
add(Vec4 abcd)(Vec4 a'b'c'd')= Vec4(a + a')(b + b')(c + c')(d + d')
mult :: Vec4 - > Vec4 - > Vec4
{ - #INLINE mult# - }
mult(Vec4 abcd)(Vec4 a'b'c'd')= Vec4(a * a')(b * b')(c * c')(d * d')
vsum :: Vec4 - > CFloat
{ - #INLINE vsum# - }
vsum(Vec4 a b c d)= a + b + c + d
multList :: Int - > Vector Vec4 - > Vector Vec4
multList!count!src
|计数<= 0 = src
|否则= multList(count-1)$ V.map(\v-> add(mult vm)a)src
main = do
print $ Data.Vector.Storable。 sum
$ Data.Vector.Storable.map vsum
$ multList repCount
$ Data.Vector.Storable.replicate arraySize(Vec4 0 0 0 0)
repCount ,arraySize :: Int
repCount = 10000
arraySize = 20000
使用ghc 6.12.1,-O2 -fasm:
<1.75>
使用ghc HEAD(6月26日),-O2 -fasm -msse2
- 1.708
这看起来是编写Vec4阵列最习惯的方式,并且性能最佳(比原来快11倍)。 (这可能会成为GHC的LLVM后端的基准)
I'm writing a game in Haskell, and my current pass at the UI involves a lot of procedural generation of geometry. I am currently focused on identifying performance of one particular operation (C-ish pseudocode):
Vec4f multiplier, addend;
Vec4f vecList[];
for (int i = 0; i < count; i++)
vecList[i] = vecList[i] * multiplier + addend;
That is, a bog-standard multiply-add of four floats, the kind of thing ripe for SIMD optimization.
The result is going to an OpenGL vertex buffer, so it has to get dumped into a flat C array eventually. For the same reason, the calculations should probably be done on C 'float' types.
I've looked for either a library or a native idiomatic solution to do this sort of thing quickly in Haskell, but every solution I've come up with seems to hover around 2% of the performance (that is, 50x slower) compared to C from GCC with the right flags. Granted, I started with Haskell a couple weeks ago, so my experience is limited—which is why I'm coming to you guys. Can any of you offer suggestions for a faster Haskell implementation, or pointers to documentation on how to write high-performance Haskell code?
First, the most recent Haskell solution (clocks about 12 seconds). I tried the bang-patterns stuff from this SO post, but it didn't make a difference AFAICT. Replacing 'multAdd' with '(\i v -> v * 4)' brought execution time down to 1.9 seconds, so the bitwise stuff (and consequent challenges to automatic optimization) doesn't seem to be too much at fault.
{-# LANGUAGE BangPatterns #-}
{-# OPTIONS_GHC -O2 -fvia-C -optc-O3 -fexcess-precision -optc-march=native #-}
import Data.Vector.Storable
import qualified Data.Vector.Storable as V
import Foreign.C.Types
import Data.Bits
repCount = 10000
arraySize = 20000
a = fromList $ [0.2::CFloat, 0.1, 0.6, 1.0]
m = fromList $ [0.99::CFloat, 0.7, 0.8, 0.6]
multAdd :: Int -> CFloat -> CFloat
multAdd !i !v = v * (m ! (i .&. 3)) + (a ! (i .&. 3))
multList :: Int -> Vector CFloat -> Vector CFloat
multList !count !src
| count <= 0 = src
| otherwise = multList (count-1) $ V.imap multAdd src
main = do
print $ Data.Vector.Storable.sum $ multList repCount $
Data.Vector.Storable.replicate (arraySize*4) (0::CFloat)
Here's what I have in C. The code here has a few #ifdefs which prevents it from being compiled straight-up; scroll down for the test driver.
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
typedef float v4fs __attribute__ ((vector_size (16)));
typedef struct { float x, y, z, w; } Vector4;
void setv4(v4fs *v, float x, float y, float z, float w) {
float *a = (float*) v;
a[0] = x;
a[1] = y;
a[2] = z;
a[3] = w;
}
float sumv4(v4fs *v) {
float *a = (float*) v;
return a[0] + a[1] + a[2] + a[3];
}
void vecmult(v4fs *MAYBE_RESTRICT s, v4fs *MAYBE_RESTRICT d, v4fs a, v4fs m) {
for (int j = 0; j < N; j++) {
d[j] = s[j] * m + a;
}
}
void scamult(float *MAYBE_RESTRICT s, float *MAYBE_RESTRICT d,
Vector4 a, Vector4 m) {
for (int j = 0; j < (N*4); j+=4) {
d[j+0] = s[j+0] * m.x + a.x;
d[j+1] = s[j+1] * m.y + a.y;
d[j+2] = s[j+2] * m.z + a.z;
d[j+3] = s[j+3] * m.w + a.w;
}
}
int main () {
v4fs a, m;
v4fs *s, *d;
setv4(&a, 0.2, 0.1, 0.6, 1.0);
setv4(&m, 0.99, 0.7, 0.8, 0.6);
s = calloc(N, sizeof(v4fs));
d = s;
double start = clock();
for (int i = 0; i < M; i++) {
#ifdef COPY
d = malloc(N * sizeof(v4fs));
#endif
#ifdef VECTOR
vecmult(s, d, a, m);
#else
Vector4 aa = *(Vector4*)(&a);
Vector4 mm = *(Vector4*)(&m);
scamult((float*)s, (float*)d, aa, mm);
#endif
#ifdef COPY
free(s);
s = d;
#endif
}
double end = clock();
float sum = 0;
for (int j = 0; j < N; j++) {
sum += sumv4(s+j);
}
printf("%-50s %2.5f %f\n\n", NAME,
(end - start) / (double) CLOCKS_PER_SEC, sum);
}
This script will compile and run the tests with a number of gcc flag combinations. The best performance was had by cmath-64-native-O3-restrict-vector-nocopy on my system, taking 0.22 seconds.
import System.Process
import GHC.IOBase
cBase = ("cmath", "gcc mult.c -ggdb --std=c99 -DM=10000 -DN=20000")
cOptions = [
[("32", "-m32"), ("64", "-m64")],
[("generic", ""), ("native", "-march=native -msse4")],
[("O1", "-O1"), ("O2", "-O2"), ("O3", "-O3")],
[("restrict", "-DMAYBE_RESTRICT=__restrict__"),
("norestrict", "-DMAYBE_RESTRICT=")],
[("vector", "-DVECTOR"), ("scalar", "")],
[("copy", "-DCOPY"), ("nocopy", "")]
]
-- Fold over the Cartesian product of the double list. Probably a Prelude function
-- or two that does this, but hey. The 'perm' referred to permutations until I realized
-- that this wasn't actually doing permutations. '
permfold :: (a -> a -> a) -> a -> [[a]] -> [a]
permfold f z [] = [z]
permfold f z (x:xs) = concat $ map (\a -> (permfold f (f z a) xs)) x
prepCmd :: (String, String) -> (String, String) -> (String, String)
prepCmd (name, cmd) (namea, cmda) =
(name ++ "-" ++ namea, cmd ++ " " ++ cmda)
runCCmd name compileCmd = do
res <- system (compileCmd ++ " -DNAME=\\\"" ++ name ++ "\\\" -o " ++ name)
if res == ExitSuccess
then do system ("./" ++ name)
return ()
else putStrLn $ name ++ " did not compile"
main = do
mapM_ (uncurry runCCmd) $ permfold prepCmd cBase cOptions
Roman Leschinkskiy responds:
Actually, the core looks mostly ok to me. Using unsafeIndex instead of (!) makes the program more than twice as fast (see my answer above). The program below is much faster, though (and cleaner, IMO). I suspect the remaining difference between this and the C program is due to GHC's general suckiness when it comes to floating point. The HEAD produces the best results with the NCG and -msse2
First, define a new Vec4 data type:
{-# LANGUAGE BangPatterns #-}
import Data.Vector.Storable
import qualified Data.Vector.Storable as V
import Foreign
import Foreign.C.Types
-- Define a 4 element vector type
data Vec4 = Vec4 {-# UNPACK #-} !CFloat
{-# UNPACK #-} !CFloat
{-# UNPACK #-} !CFloat
{-# UNPACK #-} !CFloat
Ensure we can store it in an array
instance Storable Vec4 where
sizeOf _ = sizeOf (undefined :: CFloat) * 4
alignment _ = alignment (undefined :: CFloat)
{-# INLINE peek #-}
peek p = do
a <- peekElemOff q 0
b <- peekElemOff q 1
c <- peekElemOff q 2
d <- peekElemOff q 3
return (Vec4 a b c d)
where
q = castPtr p
{-# INLINE poke #-}
poke p (Vec4 a b c d) = do
pokeElemOff q 0 a
pokeElemOff q 1 b
pokeElemOff q 2 c
pokeElemOff q 3 d
where
q = castPtr p
Values and methods on this type:
a = Vec4 0.2 0.1 0.6 1.0
m = Vec4 0.99 0.7 0.8 0.6
add :: Vec4 -> Vec4 -> Vec4
{-# INLINE add #-}
add (Vec4 a b c d) (Vec4 a' b' c' d') = Vec4 (a+a') (b+b') (c+c') (d+d')
mult :: Vec4 -> Vec4 -> Vec4
{-# INLINE mult #-}
mult (Vec4 a b c d) (Vec4 a' b' c' d') = Vec4 (a*a') (b*b') (c*c') (d*d')
vsum :: Vec4 -> CFloat
{-# INLINE vsum #-}
vsum (Vec4 a b c d) = a+b+c+d
multList :: Int -> Vector Vec4 -> Vector Vec4
multList !count !src
| count <= 0 = src
| otherwise = multList (count-1) $ V.map (\v -> add (mult v m) a) src
main = do
print $ Data.Vector.Storable.sum
$ Data.Vector.Storable.map vsum
$ multList repCount
$ Data.Vector.Storable.replicate arraySize (Vec4 0 0 0 0)
repCount, arraySize :: Int
repCount = 10000
arraySize = 20000
With ghc 6.12.1, -O2 -fasm:
- 1.752
With ghc HEAD (june 26), -O2 -fasm -msse2
- 1.708
This looks like the most idiomatic way to write a Vec4 array, and gets the best performance (11x faster than your original). (And this might become a benchmark for GHC's LLVM backend)
这篇关于Haskell在乘加运算上的数学表现的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!