我如何摆脱Core中的`let`？ [英] How can I get rid of `let` in Core?

查看：98 发布时间：2018/6/5 10:59:32 optimization haskell core

本文介绍了我如何摆脱Core中的`let`？的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有一个在内部循环中被频繁调用的函数。它看起来像这样：

 导入限定的Data.Vector.Storable作为SV 
 
 newtype Timedelta = Timedelta Double 
 
 cklsLogDens :: SV.Vector Double  - > Timedelta  - >双 - >双 - > Double 
 cklsLogDens p（Timedelta dt）x0 x1 = if si <= 0 then -1e50 else c  -  0.5 *（（x1-mu）/ sd）^ 2 
其中
 al = p`SV.unsafeIndex` 0 
 be = p`SV.unsafeIndex` 1 
 si = p`SV.unsafeIndex` 2 
 xi = p`SV.unsafeIndex` 3 
 sdt = sqrt dt 
 mu = x0 +（al + be * x0）* dt 
 sd = si *（x0 ** xi）* sdt 
c = sd`seq` -0.5 * log （2 * pi * sd ^ 2）

（使用Data.Vector.Storable是因为此函数需要稍后再处理来自C函数的数据）

GHC已经非常好地优化了它（所有的变量和操作都是原始的，据我所知），但是看着核心，还有一个 let ，它仍然在函数主体的内部。我已阅读此处（以及其他我不记得的地方）让'分配懒惰的thunk，并因此可能会导致性能下降。我可以摆脱它吗？如果可能的话，我宁愿不把我的函数转换成20个case语句，但如果这太多，我会接受。

以下是Core：

  $ wloop_s4Li [Occ = LoopBreaker] 
 :: GHC.Prim.Double＃
  - > GHC.Prim.Int＃ - > GHC.Prim.Int＃ - > GHC.Prim.Double＃
 [LclId，Arity = 3，Str = DmdType LLL] 
 $ wloop_s4Li = 
 \（ww_X4OR :: GHC.Prim.Double＃）
 （ww1_X4OW :: GHC.Prim.Int＃）
（ww2_X4P1 :: GHC.Prim.Int＃） - > 
 case GHC.Prim。<＃ww1_X4OW ww2_X4P1 of _ {
 GHC.Types.False  - > ww_X4OR; 
 GHC.Types.True  - > 
 case GHC.Prim。< = ## x_a4tg 0.0 of _ {
 GHC.Types.False  - > 
 case wildcl_index_doubleArray＃
 rb2_a4rT（GHC.Prim。+＃rb_a4rR（GHC.Prim .-＃ww1_X4OW 1））
 wild17_X4xM {__DEFAULT  - > 
 
让{b $ b ---- ^^^^想要摆脱这个！ 
 ---- 
 ---- 
 ipv1_X2S8 [Dmd = Just L] :: GHC.Prim.Double＃
 [LclId，Str = DmdType] 
 ipv1_X2S8 = 
 GHC.Prim。* ## 
（GHC.Prim。* ## x_a4tg（GHC.Prim。** ## wild17_X4xM y_a3BN））
（GHC.Prim.sqrtDouble＃ 
 case $ GHb.Prim.logDouble＃
（GHC.Prim。* ## 
 6.283185307179586（GHC.Prim。* ## ipv1_X2S8 ipv1_X2S8））
 wild18_X3Gn {__DEFAULT  - > 
 case GHC.Prim.indexDoubleArray＃
 rb2_a4rT（GHC.Prim。+＃rb_a4rR ww1_X4OW）
 wild19_X4AY {__DEFAULT  - > 
 case GHC.Prim./## 
（GHC.Prim .- ## 
 wild19_X4AY 
（GHC.Prim。+ ## 
 wild17_X4xM 
 （GHC.Prim。* ## 
（GHC.Prim。+ ## 
 x1_X3GA（GHC.Prim。* ## x2_X3cb wild17_X4xM））
 tpl1_B3）））
 ipv1_X2S8 
的wild20_X3x8 {__DEFAULT  - > 
 $ wloop_s4Li 
（GHC.Prim。+ ## 
 ww_X4OR 
（GHC.Prim .- ## 
（GHC.Prim.negateDouble＃（GHC.Prim （GHC.Prim。* ## 0.5 0.5 wild18_X3Gn））
（GHC.Prim。* ## 
 0.5（GHC.Prim。* ## wild20_X3x8 wild20_X3x8））））
（GHC.Prim。+＃ ww1_X4OW 1）
 ww2_X4P1 
} 
} 
} 
}; 
 GHC.Types.True  - > 
 $ wloop_s4Li 
（GHC.Prim。+ ## ww_X4OR -1.0e50）
（GHC.Prim。+＃ww1_X4OW 1）
 ww2_X4P1 
} 
}; （是的，当然，既然你必须问，我花了太多时间过早过早优化...）
 
 
 以下是NOINLINE的当前版本
 符合条件的Data.Vector.Storable作为SV 
 
 newtype Timedelta = Timedelta Double 
 
 cklsLogDens :: SV.Vector Double  - > Timedelta  - >双 - >双 - > Double 
 { - ＃NOINLINE cklsLogDens＃ - } 
 cklsLogDens p（Timedelta dt）x0 x1 = si`seq`（if si <= 0 then -1e50 else（sd`seq`（c  -  0.5 *（（x1-mu）/ sd）^ 2）））
其中
 al = p`SV.unsafeIndex` 0 
 be = p`SV.unsafeIndex` 1 
 si = p`SV.unsafeIndex` 2 
 xi = p`SV.unsafeIndex` 3 
 sdt = sqrt dt 
 mu = x0 +（al + be * x0）* dt 
 sd = si *（x0 ** xi）* sdt 
c = sd`seq`（-0.5 * log（2 * pi * sd ^ 2））
 
 main = putStrLn。 show $ cklsLogDens SV.empty（Timedelta 0.1）0.1 0.15 
  
对应的核心片段：
  Main.cklsLogDens [InlPrag = NOINLINE] 
 :: Data.Vector.Storable.Vector GHC.Types.Double 
  - > Main.Timedelta 
  - > GHC.Types.Double 
  - > GHC.Types.Double 
  - > GHC.Types.Double 
 [GblId，Arity = 4，Caf = NoCafRefs，Str = DmdType U（ALL）LLL] 
 Main.cklsLogDens = 
 \（p_atw :: Data.Vector .Storable.Vector GHC.Types.Double）
（ds_dVa :: Main.Timedelta）
（x0_aty :: GHC.Types.Double）
（x1_atz :: GHC.Types.Double） - > 
案例p_atw 
 of _ {Data.Vector.Storable.Vector rb_a2ml rb1_a2mm rb2_a2mn  - > 
 case GHC.Prim.readDoubleOffAddr＃
 @ GHC.Prim.RealWorld rb1_a2mm 2 GHC.Prim.realWorld＃
 of _ {（＃s2_a2nH，x_a2nI＃） - > 
案例GHC.Prim.touch＃
 @ GHC.ForeignPtr.ForeignPtrContents rb2_a2mn s2_a2nH 
 of _ {__DEFAULT  - > 
 case GHC.Prim。< = ## x_a2nI 0.0 of _ {
 GHC.Types.False  - > 
案例x0_aty of _ {GHC.Types.D＃x2_a13d  - > 
 case GHC.Prim.readDoubleOffAddr＃
 @ GHC.Prim.RealWorld rb1_a2mm 3 GHC.Prim.realWorld＃
 _ {（＃s1_X2oB，x3_X2oD＃） - > 
 case GHC.Prim.touch＃
 @ GHC.ForeignPtr.ForeignPtrContents rb2_a2mn s1_X2oB 
 of _ {__DEFAULT  - > 
 case ds_dVa 
`cast`（Main.NTCo：Timedelta :: Main.Timedelta〜＃GHC.Types.Double）
 of _ {GHC.Types.D＃x4_a13m  - > 
让{
 --- ^^^^想要摆脱这个！ 
 --- 
 ipv_sYP [Dmd = Just L] :: GHC.Prim.Double＃
 [LclId，Str = DmdType] 
 ipv_sYP = 
 GHC.Prim 
 case（GHC.Prim。* ## x_a2nI（GHC.Prim。** ## x2_a13d x3_X2oD））
（GHC.Prim.sqrtDouble＃x4_a13m）} _ {GHC.Types.D＃x5_X14E  - >的x1_atz 
 case GHC.Prim.readDoubleOffAddr＃
 @ GHC.Prim.RealWorld rb1_a2mm 0 GHC.Prim.realWorld＃
 _ {（＃s3_X2p2，x6_X2p4＃） - > 
 case GHC.Prim.touch＃
 @ GHC.ForeignPtr.ForeignPtrContents rb2_a2mn s3_X2p2 $ _ b _ _ _ _DEFAULT  - > 
 case GHC.Prim.readDoubleOffAddr＃
 @ GHC.Prim.RealWorld rb1_a2mm 1 GHC.Prim.realWorld＃
 _ {（＃s4_X2pi，x7_X2pk＃） - > 
 case GHC.Prim.touch＃
 @ GHC.ForeignPtr.ForeignPtrContents rb2_a2mn s4_X2pi 
 of _ {__DEFAULT  - > 
 case GHC.Prim.logDouble＃
（GHC.Prim。* ## 6.283185307179586（GHC.Prim。* ## ipv_sYP ipv_sYP））
 wild9_a13D {__DEFAULT  - > 
 case GHC.Prim./## 
（GHC.Prim .- ## 
 x5_X14E 
（GHC.Prim。+ ## 
 x2_a13d 
 （GHC.Prim。* ## 
（GHC.Prim。+ ## x6_X2p4（GHC.Prim。* ## x7_X2pk x2_a13d））x4_a13m）））
 ipv_sYP 
 wild10_a13O {__DEFAULT - > 
 GHC.Types.D＃
（GHC.Prim .- ## 
（GHC.Prim.negateDouble＃（GHC.Prim。* ## 0.5 wild9_a13D））
（GHC.Prim.negateDouble＃ .brim。* ## 0.5（GHC.Prim。* ## wild10_a13O wild10_a13O）））
} 
} 
} 
} 
} 
} 
} 
} 
} 
} 
}; 
 GHC.Types.True  - > lvl_r2v7 
 
 
 
 
 
 
解决方案
使用ghc-7.6.1，我在 -O 和 -O2 ，任何 seq  s或bang-patterns都不同。 let 保留在核心中。
 
 
但是我怀疑 let 是否真的有害，它绑定了一个原始值而不是盒装的值，该值在其后的三个地方使用。另外，在制作的组合中，我没有发现任何懒惰的thunk（但由于我的装配知识相当有限，请不要将它当作福音）。
 
 
 通过引入一个案例分支，我可以摆脱 let ， 
 
 
  cklsLogDens p（Timedelta dt）x0 x1 
 = case p`SV.unsafeIndex` 2 of 
 si | si <= 0  - > -1e50 
 |否则 - > 
 let al = p`SV.unsafeIndex` 0 
 be = p`SV.unsafeIndex` 1 
 xi = p`SV.unsafeIndex` 3 
 sdt = sqrt dt 
 mu = x0 +（al + be * x0）* dt 
在情况si *（x0 ** xi）* sdt 
 0  - > 0 
 sd  - > -0.5 * log（2 * pi * sd ^ 2） -  0.5 *（（x1-mu）/ sd）^ 2 
  
它只在核心中产生 case  s。由于 sd 不应该为0，所以在一个循环中，即使是一个平庸的分支预测器也应该使该分支基本上空闲。
 
 
 但是，我怀疑这是否会真正提高性能。与0进行比较花费了一个寄存器，原始产生的程序集需要较少的间接寻址，并且在需要时可以在寄存器中保留更多的值。  
I have a function that is called frequently in an internal loop. It looks like this:
import qualified Data.Vector.Storable as SV

newtype Timedelta = Timedelta Double

cklsLogDens :: SV.Vector Double -> Timedelta -> Double -> Double -> Double
cklsLogDens p (Timedelta dt) x0 x1 = if si <= 0 then -1e50 else c - 0.5*((x1-mu)/sd)^2 
  where
    al  = p `SV.unsafeIndex` 0
    be  = p `SV.unsafeIndex` 1
    si  = p `SV.unsafeIndex` 2
    xi  = p `SV.unsafeIndex` 3
    sdt = sqrt dt
    mu  = x0 + (al + be*x0)*dt
    sd  = si * (x0 ** xi) * sdt
    c   = sd `seq` -0.5 * log (2*pi*sd^2)
(Data.Vector.Storable is used because this function needs to work on data from a C function later)

GHC has optimized this very nicely (all variables and ops are primitives as far as I can tell), but looking at core, there is one let that is still inside of (what was) the body of the function. I have read here (and somewhere else I don't remember) that 'lets' allocate lazy thunks and can thus be bad for performance in tight loops. Can I get rid of it? If it all possible I would prefer not converting my function into 20 case statements, but if that is too much to ask I'll accept.

Here is the Core:
$wloop_s4Li [Occ=LoopBreaker]
  :: GHC.Prim.Double#
     -> GHC.Prim.Int# -> GHC.Prim.Int# -> GHC.Prim.Double#
[LclId, Arity=3, Str=DmdType LLL]
$wloop_s4Li =
  \ (ww_X4OR :: GHC.Prim.Double#)
    (ww1_X4OW :: GHC.Prim.Int#)
    (ww2_X4P1 :: GHC.Prim.Int#) ->
    case GHC.Prim.<# ww1_X4OW ww2_X4P1 of _ {
      GHC.Types.False -> ww_X4OR;
      GHC.Types.True ->
        case GHC.Prim.<=## x_a4tg 0.0 of _ {
          GHC.Types.False ->
            case GHC.Prim.indexDoubleArray#
                   rb2_a4rT (GHC.Prim.+# rb_a4rR (GHC.Prim.-# ww1_X4OW 1))
            of wild17_X4xM { __DEFAULT ->

            let {
      ----  ^^^^ want to get rid off this! 
      ----
      ----
              ipv1_X2S8 [Dmd=Just L] :: GHC.Prim.Double#
              [LclId, Str=DmdType]
              ipv1_X2S8 =
                GHC.Prim.*##
                  (GHC.Prim.*## x_a4tg (GHC.Prim.**## wild17_X4xM y_a3BN))
                  (GHC.Prim.sqrtDouble# tpl1_B3) } in
            case GHC.Prim.logDouble#
                   (GHC.Prim.*##
                      6.283185307179586 (GHC.Prim.*## ipv1_X2S8 ipv1_X2S8))
            of wild18_X3Gn { __DEFAULT ->
            case GHC.Prim.indexDoubleArray#
                   rb2_a4rT (GHC.Prim.+# rb_a4rR ww1_X4OW)
            of wild19_X4AY { __DEFAULT ->
            case GHC.Prim./##
                   (GHC.Prim.-##
                      wild19_X4AY
                      (GHC.Prim.+##
                         wild17_X4xM
                         (GHC.Prim.*##
                            (GHC.Prim.+##
                               x1_X3GA (GHC.Prim.*## x2_X3cb wild17_X4xM))
                            tpl1_B3)))
                   ipv1_X2S8
            of wild20_X3x8 { __DEFAULT ->
            $wloop_s4Li
              (GHC.Prim.+##
                 ww_X4OR
                 (GHC.Prim.-##
                    (GHC.Prim.negateDouble# (GHC.Prim.*## 0.5 wild18_X3Gn))
                    (GHC.Prim.*##
                       0.5 (GHC.Prim.*## wild20_X3x8 wild20_X3x8))))
              (GHC.Prim.+# ww1_X4OW 1)
              ww2_X4P1
            }
            }
            }
            };
          GHC.Types.True ->
            $wloop_s4Li
              (GHC.Prim.+## ww_X4OR -1.0e50)
              (GHC.Prim.+# ww1_X4OW 1)
              ww2_X4P1
        }
    }; }
(Yes, of course, since you must ask, I am spending waaay too much time on premature optimization...)

Here is the current version with NOINLINE
import qualified Data.Vector.Storable as SV

newtype Timedelta = Timedelta Double

cklsLogDens :: SV.Vector Double -> Timedelta -> Double -> Double -> Double
{-# NOINLINE cklsLogDens #-}
cklsLogDens p (Timedelta dt) x0 x1 = si `seq` (if si <= 0 then -1e50 else (sd `seq` (c - 0.5*((x1-mu)/sd)^2)))
  where
    al  = p `SV.unsafeIndex` 0
    be  = p `SV.unsafeIndex` 1
    si  = p `SV.unsafeIndex` 2
    xi  = p `SV.unsafeIndex` 3
    sdt = sqrt dt
    mu  = x0 + (al + be*x0)*dt
    sd  = si * (x0 ** xi) * sdt
    c   = sd `seq` (-0.5 * log (2*pi*sd^2))

main = putStrLn . show $ cklsLogDens SV.empty (Timedelta 0.1) 0.1 0.15
Corresponding Core snippet:
Main.cklsLogDens [InlPrag=NOINLINE]
  :: Data.Vector.Storable.Vector GHC.Types.Double
     -> Main.Timedelta
     -> GHC.Types.Double
     -> GHC.Types.Double
     -> GHC.Types.Double
[GblId, Arity=4, Caf=NoCafRefs, Str=DmdType U(ALL)LLL]
Main.cklsLogDens =
  \ (p_atw :: Data.Vector.Storable.Vector GHC.Types.Double)
    (ds_dVa :: Main.Timedelta)
    (x0_aty :: GHC.Types.Double)
    (x1_atz :: GHC.Types.Double) ->
    case p_atw
    of _ { Data.Vector.Storable.Vector rb_a2ml rb1_a2mm rb2_a2mn ->
    case GHC.Prim.readDoubleOffAddr#
           @ GHC.Prim.RealWorld rb1_a2mm 2 GHC.Prim.realWorld#
    of _ { (# s2_a2nH, x_a2nI #) ->
    case GHC.Prim.touch#
           @ GHC.ForeignPtr.ForeignPtrContents rb2_a2mn s2_a2nH
    of _ { __DEFAULT ->
    case GHC.Prim.<=## x_a2nI 0.0 of _ {
      GHC.Types.False ->
        case x0_aty of _ { GHC.Types.D# x2_a13d ->
        case GHC.Prim.readDoubleOffAddr#
               @ GHC.Prim.RealWorld rb1_a2mm 3 GHC.Prim.realWorld#
        of _ { (# s1_X2oB, x3_X2oD #) ->
        case GHC.Prim.touch#
               @ GHC.ForeignPtr.ForeignPtrContents rb2_a2mn s1_X2oB
        of _ { __DEFAULT ->
        case ds_dVa
             `cast` (Main.NTCo:Timedelta :: Main.Timedelta ~# GHC.Types.Double)
        of _ { GHC.Types.D# x4_a13m ->
        let {
   --- ^^^^ want to get rid of this!
   ---
          ipv_sYP [Dmd=Just L] :: GHC.Prim.Double#
          [LclId, Str=DmdType]
          ipv_sYP =
            GHC.Prim.*##
              (GHC.Prim.*## x_a2nI (GHC.Prim.**## x2_a13d x3_X2oD))
              (GHC.Prim.sqrtDouble# x4_a13m) } in
        case x1_atz of _ { GHC.Types.D# x5_X14E ->
        case GHC.Prim.readDoubleOffAddr#
               @ GHC.Prim.RealWorld rb1_a2mm 0 GHC.Prim.realWorld#
        of _ { (# s3_X2p2, x6_X2p4 #) ->
        case GHC.Prim.touch#
               @ GHC.ForeignPtr.ForeignPtrContents rb2_a2mn s3_X2p2
        of _ { __DEFAULT ->
        case GHC.Prim.readDoubleOffAddr#
               @ GHC.Prim.RealWorld rb1_a2mm 1 GHC.Prim.realWorld#
        of _ { (# s4_X2pi, x7_X2pk #) ->
        case GHC.Prim.touch#
               @ GHC.ForeignPtr.ForeignPtrContents rb2_a2mn s4_X2pi
        of _ { __DEFAULT ->
        case GHC.Prim.logDouble#
               (GHC.Prim.*## 6.283185307179586 (GHC.Prim.*## ipv_sYP ipv_sYP))
        of wild9_a13D { __DEFAULT ->
        case GHC.Prim./##
               (GHC.Prim.-##
                  x5_X14E
                  (GHC.Prim.+##
                     x2_a13d
                     (GHC.Prim.*##
                        (GHC.Prim.+## x6_X2p4 (GHC.Prim.*## x7_X2pk x2_a13d)) x4_a13m)))
               ipv_sYP
        of wild10_a13O { __DEFAULT ->
        GHC.Types.D#
          (GHC.Prim.-##
             (GHC.Prim.negateDouble# (GHC.Prim.*## 0.5 wild9_a13D))
             (GHC.Prim.*## 0.5 (GHC.Prim.*## wild10_a13O wild10_a13O)))
        }
        }
        }
        }
        }
        }
        }
        }
        }
        }
        };
      GHC.Types.True -> lvl_r2v7
    }
    }
    }
    }

 解决方案 
Using ghc-7.6.1, I get no difference between -O and -O2, and neither do any seqs or bang-patterns make a difference.The let remains in the core.

But I doubt that let is really harmful, it binds a primitive value, not a boxed one, and that value is used in three places thereafter. Besides, in the produced assembly, I can find no hint of a lazy thunk (but since my knowledge of assembly is rather limited, don't take this as gospel).

I can get rid of the let by introducing a case-branch,
cklsLogDens p (Timedelta dt) x0 x1
    = case p `SV.unsafeIndex` 2 of
        si | si <= 0   -> -1e50
           | otherwise ->
                let al  = p `SV.unsafeIndex` 0
                    be  = p `SV.unsafeIndex` 1
                    xi  = p `SV.unsafeIndex` 3
                    sdt = sqrt dt
                    mu  = x0 + (al + be*x0)*dt
                in case si*(x0**xi)*sdt of
                     0   -> 0
                     sd -> -0.5*log (2*pi*sd^2) - 0.5*((x1-mu)/sd)^2
which only produces cases in the core. Since sd should never be 0, in a loop, even a mediocre branch predictor should make that branch essentially free.

However, I doubt whether that would actually improve performance. The comparison to 0 costs a register, the assembly produced by the original needs less indirect addressing and can keep more values in the registers when they are needed. 

                        这篇关于我如何摆脱Core中的`let`？的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！


                    
                        查看全文

我如何摆脱Core中的`let`？ [英] How can I get rid of `let` in Core?

问题描述

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

我如何摆脱Core中的`let`？ [英] How can I get rid of `let` in Core?

问题描述

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭