我如何摆脱Core中的`let`? [英] How can I get rid of `let` in Core?
问题描述
我有一个在内部循环中被频繁调用的函数。它看起来像这样:
导入限定的Data.Vector.Storable作为SV
newtype Timedelta = Timedelta Double
cklsLogDens :: SV.Vector Double - > Timedelta - >双 - >双 - > Double
cklsLogDens p(Timedelta dt)x0 x1 = if si <= 0 then -1e50 else c - 0.5 *((x1-mu)/ sd)^ 2
其中
al = p`SV.unsafeIndex` 0
be = p`SV.unsafeIndex` 1
si = p`SV.unsafeIndex` 2
xi = p`SV.unsafeIndex` 3
sdt = sqrt dt
mu = x0 +(al + be * x0)* dt
sd = si *(x0 ** xi)* sdt
c = sd`seq` -0.5 * log (2 * pi * sd ^ 2)
(使用Data.Vector.Storable是因为此函数需要稍后再处理来自C函数的数据)
GHC已经非常好地优化了它(所有的变量和操作都是原始的,据我所知),但是看着核心,还有一个 let
,它仍然在函数主体的内部。我已阅读此处(以及其他我不记得的地方)让'分配懒惰的thunk,并因此可能会导致性能下降。我可以摆脱它吗?如果可能的话,我宁愿不把我的函数转换成20个case语句,但如果这太多,我会接受。
以下是Core:
$ wloop_s4Li [Occ = LoopBreaker]
:: GHC.Prim.Double#
- > GHC.Prim.Int# - > GHC.Prim.Int# - > GHC.Prim.Double#
[LclId,Arity = 3,Str = DmdType LLL]
$ wloop_s4Li =
\(ww_X4OR :: GHC.Prim.Double#)
(ww1_X4OW :: GHC.Prim.Int#)
(ww2_X4P1 :: GHC.Prim.Int#) - >
case GHC.Prim。<#ww1_X4OW ww2_X4P1 of _ {
GHC.Types.False - > ww_X4OR;
GHC.Types.True - >
case GHC.Prim。< = ## x_a4tg 0.0 of _ {
GHC.Types.False - >
case wildcl_index_doubleArray#
rb2_a4rT(GHC.Prim。+#rb_a4rR(GHC.Prim .-#ww1_X4OW 1))
wild17_X4xM {__DEFAULT - >
让{b $ b ---- ^^^^想要摆脱这个!
----
----
ipv1_X2S8 [Dmd = Just L] :: GHC.Prim.Double#
[LclId,Str = DmdType]
ipv1_X2S8 =
GHC.Prim。* ##
(GHC.Prim。* ## x_a4tg(GHC.Prim。** ## wild17_X4xM y_a3BN))
(GHC.Prim.sqrtDouble#
case $ GHb.Prim.logDouble#
(GHC.Prim。* ##
6.283185307179586(GHC.Prim。* ## ipv1_X2S8 ipv1_X2S8))
wild18_X3Gn {__DEFAULT - >
case GHC.Prim.indexDoubleArray#
rb2_a4rT(GHC.Prim。+#rb_a4rR ww1_X4OW)
wild19_X4AY {__DEFAULT - >
case GHC.Prim./##
(GHC.Prim .- ##
wild19_X4AY
(GHC.Prim。+ ##
wild17_X4xM
(GHC.Prim。* ##
(GHC.Prim。+ ##
x1_X3GA(GHC.Prim。* ## x2_X3cb wild17_X4xM))
tpl1_B3)))
ipv1_X2S8
的wild20_X3x8 {__DEFAULT - >
$ wloop_s4Li
(GHC.Prim。+ ##
ww_X4OR
(GHC.Prim .- ##
(GHC.Prim.negateDouble#(GHC.Prim (GHC.Prim。* ## 0.5 0.5 wild18_X3Gn))
(GHC.Prim。* ##
0.5(GHC.Prim。* ## wild20_X3x8 wild20_X3x8))))
(GHC.Prim。+# ww1_X4OW 1)
ww2_X4P1
}
}
}
};
GHC.Types.True - >
$ wloop_s4Li
(GHC.Prim。+ ## ww_X4OR -1.0e50)
(GHC.Prim。+#ww1_X4OW 1)
ww2_X4P1
}
}; (是的,当然,既然你必须问,我花了太多时间过早过早优化...)
以下是NOINLINE的当前版本
符合条件的Data.Vector.Storable作为SV
newtype Timedelta = Timedelta Double
cklsLogDens :: SV.Vector Double - > Timedelta - >双 - >双 - > Double
{ - #NOINLINE cklsLogDens# - }
cklsLogDens p(Timedelta dt)x0 x1 = si`seq`(if si <= 0 then -1e50 else(sd`seq`(c - 0.5 *((x1-mu)/ sd)^ 2)))
其中
al = p`SV.unsafeIndex` 0
be = p`SV.unsafeIndex` 1
si = p`SV.unsafeIndex` 2
xi = p`SV.unsafeIndex` 3
sdt = sqrt dt
mu = x0 +(al + be * x0)* dt
sd = si *(x0 ** xi)* sdt
c = sd`seq`(-0.5 * log(2 * pi * sd ^ 2))
main = putStrLn。 show $ cklsLogDens SV.empty(Timedelta 0.1)0.1 0.15
对应的核心片段:
Main.cklsLogDens [InlPrag = NOINLINE]
:: Data.Vector.Storable.Vector GHC.Types.Double
- > Main.Timedelta
- > GHC.Types.Double
- > GHC.Types.Double
- > GHC.Types.Double
[GblId,Arity = 4,Caf = NoCafRefs,Str = DmdType U(ALL)LLL]
Main.cklsLogDens =
\(p_atw :: Data.Vector .Storable.Vector GHC.Types.Double)
(ds_dVa :: Main.Timedelta)
(x0_aty :: GHC.Types.Double)
(x1_atz :: GHC.Types.Double) - >
案例p_atw
of _ {Data.Vector.Storable.Vector rb_a2ml rb1_a2mm rb2_a2mn - >
case GHC.Prim.readDoubleOffAddr#
@ GHC.Prim.RealWorld rb1_a2mm 2 GHC.Prim.realWorld#
of _ {(#s2_a2nH,x_a2nI#) - >
案例GHC.Prim.touch#
@ GHC.ForeignPtr.ForeignPtrContents rb2_a2mn s2_a2nH
of _ {__DEFAULT - >
case GHC.Prim。< = ## x_a2nI 0.0 of _ {
GHC.Types.False - >
案例x0_aty of _ {GHC.Types.D#x2_a13d - >
case GHC.Prim.readDoubleOffAddr#
@ GHC.Prim.RealWorld rb1_a2mm 3 GHC.Prim.realWorld#
_ {(#s1_X2oB,x3_X2oD#) - >
case GHC.Prim.touch#
@ GHC.ForeignPtr.ForeignPtrContents rb2_a2mn s1_X2oB
of _ {__DEFAULT - >
case ds_dVa
`cast`(Main.NTCo:Timedelta :: Main.Timedelta〜#GHC.Types.Double)
of _ {GHC.Types.D#x4_a13m - >
让{
--- ^^^^想要摆脱这个!
---
ipv_sYP [Dmd = Just L] :: GHC.Prim.Double#
[LclId,Str = DmdType]
ipv_sYP =
GHC.Prim
case(GHC.Prim。* ## x_a2nI(GHC.Prim。** ## x2_a13d x3_X2oD))
(GHC.Prim.sqrtDouble#x4_a13m)} _ {GHC.Types.D#x5_X14E - >的x1_atz
case GHC.Prim.readDoubleOffAddr#
@ GHC.Prim.RealWorld rb1_a2mm 0 GHC.Prim.realWorld#
_ {(#s3_X2p2,x6_X2p4#) - >
case GHC.Prim.touch#
@ GHC.ForeignPtr.ForeignPtrContents rb2_a2mn s3_X2p2 $ _ b _ _ _ _DEFAULT - >
case GHC.Prim.readDoubleOffAddr#
@ GHC.Prim.RealWorld rb1_a2mm 1 GHC.Prim.realWorld#
_ {(#s4_X2pi,x7_X2pk#) - >
case GHC.Prim.touch#
@ GHC.ForeignPtr.ForeignPtrContents rb2_a2mn s4_X2pi
of _ {__DEFAULT - >
case GHC.Prim.logDouble#
(GHC.Prim。* ## 6.283185307179586(GHC.Prim。* ## ipv_sYP ipv_sYP))
wild9_a13D {__DEFAULT - >
case GHC.Prim./##
(GHC.Prim .- ##
x5_X14E
(GHC.Prim。+ ##
x2_a13d
(GHC.Prim。* ##
(GHC.Prim。+ ## x6_X2p4(GHC.Prim。* ## x7_X2pk x2_a13d))x4_a13m)))
ipv_sYP
wild10_a13O {__DEFAULT - >
GHC.Types.D#
(GHC.Prim .- ##
(GHC.Prim.negateDouble#(GHC.Prim。* ## 0.5 wild9_a13D))
(GHC.Prim.negateDouble# .brim。* ## 0.5(GHC.Prim。* ## wild10_a13O wild10_a13O)))
}
}
}
}
}
}
}
}
}
}
};
GHC.Types.True - > lvl_r2v7
解决方案使用ghc-7.6.1,我在 -O
和 -O2 $ c之间没有区别$ c>,任何 seq
s或bang-patterns都不同。 let
保留在核心中。
但是我怀疑 let
是否真的有害,它绑定了一个原始值而不是盒装的值,该值在其后的三个地方使用。另外,在制作的组合中,我没有发现任何懒惰的thunk(但由于我的装配知识相当有限,请不要将它当作福音)。
通过引入一个案例分支,我可以摆脱 let
,
cklsLogDens p(Timedelta dt)x0 x1
= case p`SV.unsafeIndex` 2 of
si | si <= 0 - > -1e50
|否则 - >
let al = p`SV.unsafeIndex` 0
be = p`SV.unsafeIndex` 1
xi = p`SV.unsafeIndex` 3
sdt = sqrt dt
mu = x0 +(al + be * x0)* dt
在情况si *(x0 ** xi)* sdt
0 - > 0
sd - > -0.5 * log(2 * pi * sd ^ 2) - 0.5 *((x1-mu)/ sd)^ 2
它只在核心中产生 case
s。由于 sd
不应该为0,所以在一个循环中,即使是一个平庸的分支预测器也应该使该分支基本上空闲。
但是,我怀疑这是否会真正提高性能。与0进行比较花费了一个寄存器,原始产生的程序集需要较少的间接寻址,并且在需要时可以在寄存器中保留更多的值。
I have a function that is called frequently in an internal loop. It looks like this:
import qualified Data.Vector.Storable as SV
newtype Timedelta = Timedelta Double
cklsLogDens :: SV.Vector Double -> Timedelta -> Double -> Double -> Double
cklsLogDens p (Timedelta dt) x0 x1 = if si <= 0 then -1e50 else c - 0.5*((x1-mu)/sd)^2
where
al = p `SV.unsafeIndex` 0
be = p `SV.unsafeIndex` 1
si = p `SV.unsafeIndex` 2
xi = p `SV.unsafeIndex` 3
sdt = sqrt dt
mu = x0 + (al + be*x0)*dt
sd = si * (x0 ** xi) * sdt
c = sd `seq` -0.5 * log (2*pi*sd^2)
(Data.Vector.Storable is used because this function needs to work on data from a C function later)
GHC has optimized this very nicely (all variables and ops are primitives as far as I can tell), but looking at core, there is one let
that is still inside of (what was) the body of the function. I have read here (and somewhere else I don't remember) that 'lets' allocate lazy thunks and can thus be bad for performance in tight loops. Can I get rid of it? If it all possible I would prefer not converting my function into 20 case statements, but if that is too much to ask I'll accept.
Here is the Core:
$wloop_s4Li [Occ=LoopBreaker]
:: GHC.Prim.Double#
-> GHC.Prim.Int# -> GHC.Prim.Int# -> GHC.Prim.Double#
[LclId, Arity=3, Str=DmdType LLL]
$wloop_s4Li =
\ (ww_X4OR :: GHC.Prim.Double#)
(ww1_X4OW :: GHC.Prim.Int#)
(ww2_X4P1 :: GHC.Prim.Int#) ->
case GHC.Prim.<# ww1_X4OW ww2_X4P1 of _ {
GHC.Types.False -> ww_X4OR;
GHC.Types.True ->
case GHC.Prim.<=## x_a4tg 0.0 of _ {
GHC.Types.False ->
case GHC.Prim.indexDoubleArray#
rb2_a4rT (GHC.Prim.+# rb_a4rR (GHC.Prim.-# ww1_X4OW 1))
of wild17_X4xM { __DEFAULT ->
let {
---- ^^^^ want to get rid off this!
----
----
ipv1_X2S8 [Dmd=Just L] :: GHC.Prim.Double#
[LclId, Str=DmdType]
ipv1_X2S8 =
GHC.Prim.*##
(GHC.Prim.*## x_a4tg (GHC.Prim.**## wild17_X4xM y_a3BN))
(GHC.Prim.sqrtDouble# tpl1_B3) } in
case GHC.Prim.logDouble#
(GHC.Prim.*##
6.283185307179586 (GHC.Prim.*## ipv1_X2S8 ipv1_X2S8))
of wild18_X3Gn { __DEFAULT ->
case GHC.Prim.indexDoubleArray#
rb2_a4rT (GHC.Prim.+# rb_a4rR ww1_X4OW)
of wild19_X4AY { __DEFAULT ->
case GHC.Prim./##
(GHC.Prim.-##
wild19_X4AY
(GHC.Prim.+##
wild17_X4xM
(GHC.Prim.*##
(GHC.Prim.+##
x1_X3GA (GHC.Prim.*## x2_X3cb wild17_X4xM))
tpl1_B3)))
ipv1_X2S8
of wild20_X3x8 { __DEFAULT ->
$wloop_s4Li
(GHC.Prim.+##
ww_X4OR
(GHC.Prim.-##
(GHC.Prim.negateDouble# (GHC.Prim.*## 0.5 wild18_X3Gn))
(GHC.Prim.*##
0.5 (GHC.Prim.*## wild20_X3x8 wild20_X3x8))))
(GHC.Prim.+# ww1_X4OW 1)
ww2_X4P1
}
}
}
};
GHC.Types.True ->
$wloop_s4Li
(GHC.Prim.+## ww_X4OR -1.0e50)
(GHC.Prim.+# ww1_X4OW 1)
ww2_X4P1
}
}; }
(Yes, of course, since you must ask, I am spending waaay too much time on premature optimization...)
Here is the current version with NOINLINE
import qualified Data.Vector.Storable as SV
newtype Timedelta = Timedelta Double
cklsLogDens :: SV.Vector Double -> Timedelta -> Double -> Double -> Double
{-# NOINLINE cklsLogDens #-}
cklsLogDens p (Timedelta dt) x0 x1 = si `seq` (if si <= 0 then -1e50 else (sd `seq` (c - 0.5*((x1-mu)/sd)^2)))
where
al = p `SV.unsafeIndex` 0
be = p `SV.unsafeIndex` 1
si = p `SV.unsafeIndex` 2
xi = p `SV.unsafeIndex` 3
sdt = sqrt dt
mu = x0 + (al + be*x0)*dt
sd = si * (x0 ** xi) * sdt
c = sd `seq` (-0.5 * log (2*pi*sd^2))
main = putStrLn . show $ cklsLogDens SV.empty (Timedelta 0.1) 0.1 0.15
Corresponding Core snippet:
Main.cklsLogDens [InlPrag=NOINLINE]
:: Data.Vector.Storable.Vector GHC.Types.Double
-> Main.Timedelta
-> GHC.Types.Double
-> GHC.Types.Double
-> GHC.Types.Double
[GblId, Arity=4, Caf=NoCafRefs, Str=DmdType U(ALL)LLL]
Main.cklsLogDens =
\ (p_atw :: Data.Vector.Storable.Vector GHC.Types.Double)
(ds_dVa :: Main.Timedelta)
(x0_aty :: GHC.Types.Double)
(x1_atz :: GHC.Types.Double) ->
case p_atw
of _ { Data.Vector.Storable.Vector rb_a2ml rb1_a2mm rb2_a2mn ->
case GHC.Prim.readDoubleOffAddr#
@ GHC.Prim.RealWorld rb1_a2mm 2 GHC.Prim.realWorld#
of _ { (# s2_a2nH, x_a2nI #) ->
case GHC.Prim.touch#
@ GHC.ForeignPtr.ForeignPtrContents rb2_a2mn s2_a2nH
of _ { __DEFAULT ->
case GHC.Prim.<=## x_a2nI 0.0 of _ {
GHC.Types.False ->
case x0_aty of _ { GHC.Types.D# x2_a13d ->
case GHC.Prim.readDoubleOffAddr#
@ GHC.Prim.RealWorld rb1_a2mm 3 GHC.Prim.realWorld#
of _ { (# s1_X2oB, x3_X2oD #) ->
case GHC.Prim.touch#
@ GHC.ForeignPtr.ForeignPtrContents rb2_a2mn s1_X2oB
of _ { __DEFAULT ->
case ds_dVa
`cast` (Main.NTCo:Timedelta :: Main.Timedelta ~# GHC.Types.Double)
of _ { GHC.Types.D# x4_a13m ->
let {
--- ^^^^ want to get rid of this!
---
ipv_sYP [Dmd=Just L] :: GHC.Prim.Double#
[LclId, Str=DmdType]
ipv_sYP =
GHC.Prim.*##
(GHC.Prim.*## x_a2nI (GHC.Prim.**## x2_a13d x3_X2oD))
(GHC.Prim.sqrtDouble# x4_a13m) } in
case x1_atz of _ { GHC.Types.D# x5_X14E ->
case GHC.Prim.readDoubleOffAddr#
@ GHC.Prim.RealWorld rb1_a2mm 0 GHC.Prim.realWorld#
of _ { (# s3_X2p2, x6_X2p4 #) ->
case GHC.Prim.touch#
@ GHC.ForeignPtr.ForeignPtrContents rb2_a2mn s3_X2p2
of _ { __DEFAULT ->
case GHC.Prim.readDoubleOffAddr#
@ GHC.Prim.RealWorld rb1_a2mm 1 GHC.Prim.realWorld#
of _ { (# s4_X2pi, x7_X2pk #) ->
case GHC.Prim.touch#
@ GHC.ForeignPtr.ForeignPtrContents rb2_a2mn s4_X2pi
of _ { __DEFAULT ->
case GHC.Prim.logDouble#
(GHC.Prim.*## 6.283185307179586 (GHC.Prim.*## ipv_sYP ipv_sYP))
of wild9_a13D { __DEFAULT ->
case GHC.Prim./##
(GHC.Prim.-##
x5_X14E
(GHC.Prim.+##
x2_a13d
(GHC.Prim.*##
(GHC.Prim.+## x6_X2p4 (GHC.Prim.*## x7_X2pk x2_a13d)) x4_a13m)))
ipv_sYP
of wild10_a13O { __DEFAULT ->
GHC.Types.D#
(GHC.Prim.-##
(GHC.Prim.negateDouble# (GHC.Prim.*## 0.5 wild9_a13D))
(GHC.Prim.*## 0.5 (GHC.Prim.*## wild10_a13O wild10_a13O)))
}
}
}
}
}
}
}
}
}
}
};
GHC.Types.True -> lvl_r2v7
}
}
}
}
解决方案 Using ghc-7.6.1, I get no difference between -O
and -O2
, and neither do any seq
s or bang-patterns make a difference.The let
remains in the core.
But I doubt that let
is really harmful, it binds a primitive value, not a boxed one, and that value is used in three places thereafter. Besides, in the produced assembly, I can find no hint of a lazy thunk (but since my knowledge of assembly is rather limited, don't take this as gospel).
I can get rid of the let
by introducing a case-branch,
cklsLogDens p (Timedelta dt) x0 x1
= case p `SV.unsafeIndex` 2 of
si | si <= 0 -> -1e50
| otherwise ->
let al = p `SV.unsafeIndex` 0
be = p `SV.unsafeIndex` 1
xi = p `SV.unsafeIndex` 3
sdt = sqrt dt
mu = x0 + (al + be*x0)*dt
in case si*(x0**xi)*sdt of
0 -> 0
sd -> -0.5*log (2*pi*sd^2) - 0.5*((x1-mu)/sd)^2
which only produces case
s in the core. Since sd
should never be 0, in a loop, even a mediocre branch predictor should make that branch essentially free.
However, I doubt whether that would actually improve performance. The comparison to 0 costs a register, the assembly produced by the original needs less indirect addressing and can keep more values in the registers when they are needed.
这篇关于我如何摆脱Core中的`let`?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!