GHC可以预期哪些优化能够可靠地执行? [英] What optimizations can GHC be expected to perform reliably?

查看:124
本文介绍了GHC可以预期哪些优化能够可靠地执行?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

GHC有很多可以执行的优化,但我不知道它们是什么,也不知道它们在什么情况下执行的可能性。



我的问题是:我可以期待它每次或几乎都能应用哪些转换?如果我看一段经常执行(评估)的代码,我的第一个想法是嗯,也许我应该优化它,在这种情况下,我应该第二个想法是甚至不要考虑它, GHC得到了这个?



我正在阅读数据流融合:从列表到数据流到完全不存在,以及他们用于将列表处理重写为不同形式的技术,而GHC的正常优化将可靠地优化为简单循环,这对我来说是新颖的。我如何知道自己的程序何时可以进行这种优化?

一些信息,但它只是回答问题的一部分。



编辑:我开始赏金。我想要的是像lambda / let / case-floating,类型/构造函数/函数参数专门化,严格分析和拆箱,worker / wrapper以及我忽略的其他重要GHC的其他更低级别转换的列表,以及输入和输出代码的解释和例子,并且理想地说明当总效应大于其部分总和时的情况。最理想的情况是提到什么时候转换不会发生。我并不期望每一次转换都有长篇大论的解释,只要大局就是这样,一些句子和内联单行代码示例就足够了(或者链接,如果它不是20页的科学论文)清除它的结尾。我希望能够查看一段代码,并能够很好地猜测它是否会编译成一个紧密的循环,或者为什么不是,或者我将不得不改变以实现它。 (我对流融合这样的大型优化框架(我只是阅读一篇关于这方面的文章)并没有太感兴趣;更多的是那些写这些框架的人拥有的知识。)

解决方案

这个GHC Trac页面也很好地解释了通行证。 此页面解释了优化排序,但是,像大多数Trac Wiki已经过时了。



具体来说,最好的办法是看看特定的程序是如何编译的。查看正在执行哪些优化的最佳方法是使用 -v 标志详细编译程序。以我在计算机上可以找到的第一块Haskell为例:

  Glasgow Haskell编译器,版本7.4.2,stage 2由GHC版本7.4.1引导
使用二进制包数据库:/usr/lib/ghc-7.4.2/package.conf.d/package.cache
线程包ghc-prim映射到ghc-prim-0.2.0.0-7d3c2c69a5e8257a04b2c679c40e2fa7
接线包整数gmp映射到整数-gmp-0.4.0.0-af3a28fdc4138858e0c7c5ecc2a64f43
接线包基数映射到基本-4.5.1.0-6e4c9bdc36eeb9121f27ccbbcb62e3f3
线控封装rts映射到builtin_rts
线控输入包模板-haskell映射到模板-haskell-2.7.0.0-2bd128e15c2d50997ec26a1eaf8b23bf
线路封装dph-seq未找到。未找到
线路包裹dph-par。
Hsc静态标志:-static
***追逐依赖关系:
从以下模块中追逐模块:* SleepSort.hs
稳定的obj:[Main]
稳定的BCO:[ ]
准备上涨
[NONREC
ModSummary {
ms_hs_date =星期二Oct 18 22:22:11 CDT 2011
ms_mod = main:Main,
ms_textual_imps = [import(implicit)Prelude,import Control.Monad,$ b $ import Control.Concurrent,import System.Environment]
ms_srcimps = []
]]
***删除临时文件:
删除:
编译:输入文件SleepSort.hs
创建临时目录:/ tmp / ghc4784_0
***检查旧的主界面:Main:
[1之1]编译Main(SleepSort.hs,SleepSort.o)
***解析器:
*** Renamer /类型检查器:
*** Desugar:
Desugar的结果大小(优化后)= 79
***简化器:
简化器迭代的结果大小= 1 = 87
简化器迭代的结果大小= 2 = 93
Simplifier迭代的结果大小= 3 = 83
简化器的结果大小= 83
***专用:
专用的结果大小= 83
***浮出(FOS {Lam = Just 0,Consts = True,PAPs = False}):
Float out的结果大小(FOS {Lam = Just 0,
Consts = True,
PAPs = False })= 95
***向内浮动:
Float向内的结果大小= 95
***简化:
简化迭代的结果大小= 1 = 253
简化器迭代的结果大小= 2 = 229
简化器的结果大小= 229
***简化器:
简化器迭代的结果大小= 1 = 218
简化器的结果大小= 218
***简化器:
简化器迭代的结果大小= 1 = 283
简化器迭代的结果大小= 2 = 226
简化器迭代的结果大小= 3 = 202
简化的结果大小= 202
***需求分析:
需求分析的结果大小= 202
***工作包装绑定:
Re Simplifier:
Simplifier的结果大小= 202
*** Float out(FOS {Lam = Just 0,Consts = True,PAPs = True}):
Float out的结果大小(FOS {Lam = Just 0,
Consts = True,
PAPs = True})= 210
***表达式:
Common子表达式的结果大小= 210
***内向浮动:
浮动内向的结果大小= 210
***释放大小写:
Liberate case的结果大小= 210
*** Simplifier:
Simplifier迭代的结果大小= 1 = 206
Simplifier的结果大小= 206
*** SpecConstr:
SpecConstr = 206
的结果大小Simplifier:
Simplifier的结果大小= 206
*** Tidy Core:
Tidy Core的结果大小= 206
writeBinIface:4个名字
writeBinIface:28个字典条目
*** CorePrep:
CorePrep的结果大小= 224
*** Stg2Stg:
*** CodeGen:
*** CodeOutput:
***汇编器:
'/ usr / bin / gcc''-fno-stack-protector''-Wl, - hash-size = 31''-Wl, - 减少内存开销'-I'。 '-c''/tmp/ghc4784_0/ghc4784_0.s''-o''SleepSort.o'
上传完全成功。
***删除临时文件:
删除:/tmp/ghc4784_0/ghc4784_0.c /tmp/ghc4784_0/ghc4784_0.s
警告:删除不存在的/ tmp / ghc4784_0 / ghc4784_0。 c
链接:linkables是...
LinkableM(Sat Sep 29 20:21:02 CDT 2012)main:Main
[DotO SleepSort.o]
链接SleepSort .. 。
*** C编译器:
'/ usr / bin / gcc''-fno-stack-protector''-Wl, - hash-size = 31''-Wl, - reduce -me''/tmp/ghc4784_0/ghc4784_0.c''-o''/tmp/ghc4784_0/ghc4784_0.o''-DTABLES_NEXT_TO_CODE''-I / usr / lib / ghc-7.4.2 / include'
*** C编译器:
'/ usr / bin / gcc''-fno-stack-protector''-Wl, - hash-size = 31''-Wl, - '-c''/tmp/ghc4784_0/ghc4784_0.s''-o''/tmp/ghc4784_0/ghc4784_1.o''-DTABLES_NEXT_TO_CODE''-I / usr / lib / ghc-7.4 .2 / include'
***链接器:
'/ usr / bin / gcc''-fno-stack-protector''-Wl, - hash-size = 31''-Wl, - 减少内存开销''-o' SleepSort''SleepSort.o'-L / usr / lib / ghc-7.4.2 / base-4.5.1.0'-L / usr / lib / ghc-7.4.2 / integer-gmp-0.4.0.0'' -L / usr / lib / ghc-7.4.2 / ghc-prim-0.2.0.0'-L / usr / lib / ghc-7.4.2''/tmp/ghc4784_0/ghc4784_0.o''/ tmp / ghc4784_0 /ghc4784_1.o''-lHSbase-4.5.1.0''-lHSinteger-gmp-0.4.0.0''-lgmp''-lHSghc-prim-0.2.0.0''-lHSrts''-lm''-lrt'' -ldl ' '-u' 'ghczmprim_GHCziTypes_Izh_static_info' '-u' 'ghczmprim_GHCziTypes_Czh_static_info' '-u' 'ghczmprim_GHCziTypes_Fzh_static_info' '-u' 'ghczmprim_GHCziTypes_Dzh_static_info' '-u' 'base_GHCziPtr_Ptr_static_info' '-u' 'base_GHCziWord_Wzh_static_info' '-u'' base_GHCziInt_I8zh_static_info''-u''base_GHCziInt_I16zh_static_info''-u''base_GHCziInt_I32zh_static_info''-u''base_GHCziInt_I64zh_static_info''-u''base_GHCziWord_W8zh_static_info''-u''base_GHCziWord_W16zh_static_info''-u''base_GHCziWo rd_W32zh_static_info ' '-u' 'base_GHCziWord_W64zh_static_info' '-u' 'base_GHCziStable_StablePtr_static_info' '-u' 'ghczmprim_GHCziTypes_Izh_con_info' '-u' 'ghczmprim_GHCziTypes_Czh_con_info' '-u' 'ghczmprim_GHCziTypes_Fzh_con_info' '-u' 'ghczmprim_GHCziTypes_Dzh_con_info' '-u'' base_GHCziPtr_Ptr_con_info ' '-u' 'base_GHCziPtr_FunPtr_con_info' '-u' 'base_GHCziStable_StablePtr_con_info' '-u' 'ghczmprim_GHCziTypes_False_closure' '-u' 'ghczmprim_GHCziTypes_True_closure' '-u' 'base_GHCziPack_unpackCString_closure' '-u' 'base_GHCziIOziException_stackOverflow_closure' '-u' 'base_GHCziIOziException_heapOverflow_closure' '-u''base_ControlziExceptionziBase_nonTermination_closure''-u''base_GHCziIOziException_blockedIndefinitelyOnMVar_closure''-u''base_GHCziIOziException_blockedIndefinitelyOnSTM_closure''-u''base_ControlziExceptionziBase_nestedAtomica lly_closure ' '-u' 'base_GHCziWeak_runFinalizzerBatch_closure' '-u' 'base_GHCziTopHandler_flushStdHandles_closure' '-u' 'base_GHCziTopHandler_runIO_closure' '-u' 'base_GHCziTopHandler_runNonIO_closure' '-u' 'base_GHCziConcziIO_ensureIOManagerIsRunning_closure' '-u' 'base_GHCziConcziSync_runSparks_closure' '-u'' base_GHCziConcziSignal_runHandlers_closure '
link:done
***删除临时文件:
删除:/tmp/ghc4784_0/ghc4784_1.o /tmp/ghc4784_0/ghc4784_0.s /tmp/ghc4784_0/ghc4784_0.o / tmp / ghc4784_0 / ghc4784_0.c
***删除临时目录:
删除:/ tmp / ghc4784_0

从第一个 *** Simplifier:到最后一个,在所有优化阶段发生的地方,我们看到了很多。



首先,简化器几乎在所有阶段之间运行。这使得写许多遍更容易。例如,在实施许多优化时,他们只需创建重写规则来传播更改,而不必手动进行更改。简化器包含许多简单的优化,包括内联和融合。我知道的主要限制是GHC拒绝内联递归函数,并且必须正确命名以便融合才能正常工作。



接下来,我们看到完成所有优化的完整列表:


  • 专业化

    专业化的基本思想是通过标识函数被调用的地方以及创建不是多态的函数版本来移除多态性和重载 - 它们特定于它们被调用的类型。您还可以通过 SPECIALIZE 编译指示告诉编译器执行此操作。作为一个例子,取一个阶乘函数:

      fac ::(Num a,Eq a)=> a  - > a 
    fac 0 = 1
    fac n = n * fac(n-1)



    由于编译器不知道要使用的乘法的任何属性,因此它根本无法优化。但是,如果它看到它在 Int 上使用,它现在可以创建一个新版本,只有类型不同:

      fac_Int :: Int  - > Int 
    fac_Int 0 = 1
    fac_Int n = n * fac_Int(n - 1)



    接下来,下面提到的规则可以触发,并且最终你可以在unboxed Int s上工作,这比原来快得多。另一种查看专业化的方法是对类型字典和类型变量的部分应用。



    source 这里有一大堆笔记。


  • 浮出来



    编辑:我显然误解了这一点。我的解释已经完全改变了。



    这样做的基本思想是移动不应该重复出功能的计算。例如,假设我们有这个:

      \x  - >让y =昂贵的x + y 

    在上面的lambda中,每次函数被调用时, code> y 被重新计算。一个更好的函数,浮出产生,是

      let y =在\ x中很贵 - > x + y 

    为了简化过程,可以应用其他转换。例如,发生这种情况:

      \x  - > x + f 2 
    \x - > x + f_2中的f_2 = f 2
    \ x - >让f_2 = f 2 in x + f_2
    let f_2 = f 2 in \ x - > x + f_2

    同样,重复计算也会被保存。

    在这种情况下非常易读。



    目前两个相邻的lambdas之间的绑定不会被浮动。例如,这不会发生:

      \x y  - >让t = x + x进入... 

    转到

      \x  - >令\\ = x + x in \y  - > ... 


  • 向内浮动

    <引用源代码,



    floatInwards 的主要目的是浮动到一个案例的分支中,这样我们不分配东西,将它们保存在堆栈中,然后发现它们在所选分支中不需要。



    举个例子,假设我们有这个表达式:

      let x = big in 
    case v of
    True - > x + 1
    False - > 0

    如果 v 的计算结果为 False ,然后通过分配 x ,这大概是一些大的事情,我们浪费了时间和空间。浮动内向解决了这个问题,产生了这样的结果:

      case v of 
    True - >让x = big in x + 1
    False - >让x = big in 0

    ,随后由简化符替换为

      case v of 
    True - >大+ 1
    错误 - > 0

    本文虽然涵盖了其他主题,但给出了相当清晰的介绍。请注意,尽管存在名称,但浮动和浮动不会陷入无限循环,原因有两个:


    1. 浮动浮动允许 case 语句,同时抛出函数处理。

    2. 传递的顺序是固定的,所以它们不应该交替无限。







  • 需求分析

    需求分析或严格分析不像一个信息收集通过。编译器查找总是评估其参数(或至少其中一些参数)的函数,并使用按值调用而不是按需调用来传递这些参数。既然你避免了thunk的开销,这通常要快得多。 Haskell中的许多性能问题都是由于这种传递失败或者代码不够严格造成的。一个简单的例子是使用 foldr foldl foldl'总结一个整数列表 - 第一个原因导致堆栈溢出,第二个导致堆溢出,而最后一个运行正常,因为严格。这可能是所有这些方面最容易理解和最好的记录。我相信多态性和CPS代码经常会打败这个。


  • Worker Wrapper绑定

    工人/包装物转换的目的是在简单的结构上做一个紧密的循环,转换到结构的两端并从结构转换而来。例如,用这个函数计算一个数的阶乘。

      factorial :: Int  - > Int 
    factorial 0 = 1
    factorial n = n * factorial(n - 1)

    在GHC中使用 Int 的定义,我们有

      factorial :: Int  - > Int 
    factorial(I#0#)= I#1#
    factorial(I#n#)= I#(n#*#case factial(I#(n# - #1#))
    I#down# - > down#)

    注意代码的覆盖范围在 I# s?我们可以通过这样做删除它们:

      factorial :: Int  - > Int 
    factorial(I#n#)= I#(factorial#n#)

    factorial#:: Int# - > Int#
    factorial#0#= 1#
    factorial#n#= n#*#factorial#(n# - #1#)

    虽然这个特定的例子也可以通过SpecConstr来完成,但是工作者/包装器转换在它可以做的事情上是非常普遍的。


  • 常见的子表达式

    这是另一个非常简单的优化,非常有效,如严格分析。基本思想是如果你有两个相同的表达式,它们将具有相同的值。例如,如果 fib 是一个斐波那契数字计算器,CSE将会转换

      fib x x fib x 

    转换为

      let fib_x = fib x in fib_x + fib_x 

    计算减半。不幸的是,这有时会妨碍其他优化。另一个问题是这两个表达式必须位于相同的位置,并且它们必须在语法上相同,而不是按值相同。例如,CSE不会在没有一堆内联的情况下触发以下代码:

      x =(1 +(2 + 3))+((1 + 2)+ 3)
    y = fx
    z = g(fx)y

    但是,如果您通过llvm进行编译,由于其全局值编号传递,您可能会得到一些合并。

  • Liberate case



    这似乎是一个非常成文的转换,除了它可能导致代码爆炸的事实。下面是我找到的小文档的一个重新格式化(并稍微重写)的版本:



    这个模块遍历 Core ,并在自由变量上查找 case 。标准是:如果递归调用的路由上的自由变量中存在个案,则递归调用将替换为展开式。例如,在

      f = \ t  - > V a b的情况v  - > a:ft 

    内部 f 被替换。使

      f = \ t  - > V a b的情况v  - > a:(letrec f = \\ t  - > v ab的情况v  - > f:f)t 

    请注意需要阴影。简化,我们得到

      f = \ t  - > V a b的情况v  - > a:(letrec f = \ t  - > a:ft in ft)

    因为 a 在内部 letrec 内是空闲的,而不需要从 v 。请注意,这涉及到自由变量,与处理已知形式的参数的SpecConstr不同。



    SpecConstr - 这个转换类似于

    的程序

     

    f(Left x)y = somthingComplicated1
    f(Right x)y = somethingComplicated2

    转换为

      f_Left xy = somethingComplicated1 
    f_Right xy = somethingComplicated2

    { - #INLINE f# - }
    f(Left x)= f_Left x
    f(Right x)= f_Right x

    作为一个扩展的例子,可以定义 last

      last [] = errorlast:empty list
    last(x:[])= x
    last(x:x2:xs)= last(x2:xs)

    我们首先将它转换为

      last_nil = errorlast:empty list
    last_cons x [] = x
    last_cons x(x2:xs)= last(x2:xs)

    { - #INLINE last# - }
    last [] = last_nil
    last(x:xs)= last_cons x xs

    接下来,简化器运行,我们有

      last_nil = errorlast:empty list
    last_cons x [] = x
    last_cons x(x2:xs)= last_cons x2 xs

    { - #INLINE last# - }
    last [] = last_nil
    last(x:xs)= last_cons x xs

    请注意,该程序现在速度更快,因为我们不会重复装箱和拆箱清单的前面。还要注意内联是至关重要的,因为它允许实际使用新的更有效的定义,并且使递归定义更好。

    SpecConstr由启发式数量。这篇文章中提到的是这样的:


    1. lambdas是显式的,而arity是 a

    2. 右边是足够小,由标志控制。
    3. 函数是递归的,特殊的调用在右侧使用。
    4. 函数的所有参数都存在。
    5. 至少有一个参数是a构造函数应用程序。

    6. 这个参数是在函数的某个地方进行了实例分析的。 然而,启发式几乎肯定会改变。事实上,这篇文章提到了另一种第六种启发式:

      只有在参数时专门讨论 x c> x 仅由 case 检查,并且不会传递给普通函数或作为部分返回。


      这是一个非常小的文件(12行),因此可能没有触发那么多优化(尽管我认为它完成了所有)。这也没有告诉你为什么选择这些通行证,为什么它把它们放在那个顺序。


      GHC has a lot of optimizations that it can perform, but I don't know what they all are, nor how likely they are to be performed and under what circumstances.

      My question is: what transformations can I expect it to apply every time, or nearly so? If I look at a piece of code that's going to be executed (evaluated) frequently and my first thought is "hmm, maybe I should optimize that", in which cases should my second thought be, "don't even think about it, GHC got this"?

      I was reading the paper Stream Fusion: From Lists to Streams to Nothing at All, and the technique they used of rewriting list processing into a different form which GHC's normal optimizations would then reliably optimize down into simple loops was novel to me. How can I tell when my own programs are eligible for that kind of optimization?

      There's some information in the GHC manual, but it only goes part of the way towards answering the question.

      EDIT: I'm starting a bounty. What I would like is a list of lower-level transformations like lambda/let/case-floating, type/constructor/function argument specialization, strictness analysis and unboxing, worker/wrapper, and whatever else significant GHC does that I've left out, along with explanations and examples of input and output code, and ideally illustrations of situations when the total effect is more than the sum of its parts. And ideally some mention of when transformations won't happen. I'm not expecting novel-length explanations of every transformation, a couple of sentences and inline one-liner code examples could be enough (or a link, if it's not to twenty pages of scientific paper), as long as the big picture is clear by the end of it. I want to be able to look at a piece of code and be able to make a good guess about whether it will compile down to a tight loop, or why not, or what I would have to change to make it. (I'm not interested so much here in the big optimization frameworks like stream fusion (I just read a paper about that); more in the kind of knowledge that people who write these frameworks have.)

      解决方案

      This GHC Trac page also explains the passes fairly well. This page explains the optimization ordering, though, like the majority of the Trac Wiki, it is out of date.

      For specifics, the best thing to do is probably to look at how a specific program is compiled. The best way to see which optimizations are being performed is to compile the program verbosely, using the -v flag. Taking as an example the first piece of Haskell I could find on my computer:

      Glasgow Haskell Compiler, Version 7.4.2, stage 2 booted by GHC version 7.4.1
      Using binary package database: /usr/lib/ghc-7.4.2/package.conf.d/package.cache
      wired-in package ghc-prim mapped to ghc-prim-0.2.0.0-7d3c2c69a5e8257a04b2c679c40e2fa7
      wired-in package integer-gmp mapped to integer-gmp-0.4.0.0-af3a28fdc4138858e0c7c5ecc2a64f43
      wired-in package base mapped to base-4.5.1.0-6e4c9bdc36eeb9121f27ccbbcb62e3f3
      wired-in package rts mapped to builtin_rts
      wired-in package template-haskell mapped to template-haskell-2.7.0.0-2bd128e15c2d50997ec26a1eaf8b23bf
      wired-in package dph-seq not found.
      wired-in package dph-par not found.
      Hsc static flags: -static
      *** Chasing dependencies:
      Chasing modules from: *SleepSort.hs
      Stable obj: [Main]
      Stable BCO: []
      Ready for upsweep
        [NONREC
            ModSummary {
               ms_hs_date = Tue Oct 18 22:22:11 CDT 2011
               ms_mod = main:Main,
               ms_textual_imps = [import (implicit) Prelude, import Control.Monad,
                                  import Control.Concurrent, import System.Environment]
               ms_srcimps = []
            }]
      *** Deleting temp files:
      Deleting: 
      compile: input file SleepSort.hs
      Created temporary directory: /tmp/ghc4784_0
      *** Checking old interface for main:Main:
      [1 of 1] Compiling Main             ( SleepSort.hs, SleepSort.o )
      *** Parser:
      *** Renamer/typechecker:
      *** Desugar:
      Result size of Desugar (after optimization) = 79
      *** Simplifier:
      Result size of Simplifier iteration=1 = 87
      Result size of Simplifier iteration=2 = 93
      Result size of Simplifier iteration=3 = 83
      Result size of Simplifier = 83
      *** Specialise:
      Result size of Specialise = 83
      *** Float out(FOS {Lam = Just 0, Consts = True, PAPs = False}):
      Result size of Float out(FOS {Lam = Just 0,
                                    Consts = True,
                                    PAPs = False}) = 95
      *** Float inwards:
      Result size of Float inwards = 95
      *** Simplifier:
      Result size of Simplifier iteration=1 = 253
      Result size of Simplifier iteration=2 = 229
      Result size of Simplifier = 229
      *** Simplifier:
      Result size of Simplifier iteration=1 = 218
      Result size of Simplifier = 218
      *** Simplifier:
      Result size of Simplifier iteration=1 = 283
      Result size of Simplifier iteration=2 = 226
      Result size of Simplifier iteration=3 = 202
      Result size of Simplifier = 202
      *** Demand analysis:
      Result size of Demand analysis = 202
      *** Worker Wrapper binds:
      Result size of Worker Wrapper binds = 202
      *** Simplifier:
      Result size of Simplifier = 202
      *** Float out(FOS {Lam = Just 0, Consts = True, PAPs = True}):
      Result size of Float out(FOS {Lam = Just 0,
                                    Consts = True,
                                    PAPs = True}) = 210
      *** Common sub-expression:
      Result size of Common sub-expression = 210
      *** Float inwards:
      Result size of Float inwards = 210
      *** Liberate case:
      Result size of Liberate case = 210
      *** Simplifier:
      Result size of Simplifier iteration=1 = 206
      Result size of Simplifier = 206
      *** SpecConstr:
      Result size of SpecConstr = 206
      *** Simplifier:
      Result size of Simplifier = 206
      *** Tidy Core:
      Result size of Tidy Core = 206
      writeBinIface: 4 Names
      writeBinIface: 28 dict entries
      *** CorePrep:
      Result size of CorePrep = 224
      *** Stg2Stg:
      *** CodeGen:
      *** CodeOutput:
      *** Assembler:
      '/usr/bin/gcc' '-fno-stack-protector' '-Wl,--hash-size=31' '-Wl,--reduce-memory-overheads' '-I.' '-c' '/tmp/ghc4784_0/ghc4784_0.s' '-o' 'SleepSort.o'
      Upsweep completely successful.
      *** Deleting temp files:
      Deleting: /tmp/ghc4784_0/ghc4784_0.c /tmp/ghc4784_0/ghc4784_0.s
      Warning: deleting non-existent /tmp/ghc4784_0/ghc4784_0.c
      link: linkables are ...
      LinkableM (Sat Sep 29 20:21:02 CDT 2012) main:Main
         [DotO SleepSort.o]
      Linking SleepSort ...
      *** C Compiler:
      '/usr/bin/gcc' '-fno-stack-protector' '-Wl,--hash-size=31' '-Wl,--reduce-memory-overheads' '-c' '/tmp/ghc4784_0/ghc4784_0.c' '-o' '/tmp/ghc4784_0/ghc4784_0.o' '-DTABLES_NEXT_TO_CODE' '-I/usr/lib/ghc-7.4.2/include'
      *** C Compiler:
      '/usr/bin/gcc' '-fno-stack-protector' '-Wl,--hash-size=31' '-Wl,--reduce-memory-overheads' '-c' '/tmp/ghc4784_0/ghc4784_0.s' '-o' '/tmp/ghc4784_0/ghc4784_1.o' '-DTABLES_NEXT_TO_CODE' '-I/usr/lib/ghc-7.4.2/include'
      *** Linker:
      '/usr/bin/gcc' '-fno-stack-protector' '-Wl,--hash-size=31' '-Wl,--reduce-memory-overheads' '-o' 'SleepSort' 'SleepSort.o' '-L/usr/lib/ghc-7.4.2/base-4.5.1.0' '-L/usr/lib/ghc-7.4.2/integer-gmp-0.4.0.0' '-L/usr/lib/ghc-7.4.2/ghc-prim-0.2.0.0' '-L/usr/lib/ghc-7.4.2' '/tmp/ghc4784_0/ghc4784_0.o' '/tmp/ghc4784_0/ghc4784_1.o' '-lHSbase-4.5.1.0' '-lHSinteger-gmp-0.4.0.0' '-lgmp' '-lHSghc-prim-0.2.0.0' '-lHSrts' '-lm' '-lrt' '-ldl' '-u' 'ghczmprim_GHCziTypes_Izh_static_info' '-u' 'ghczmprim_GHCziTypes_Czh_static_info' '-u' 'ghczmprim_GHCziTypes_Fzh_static_info' '-u' 'ghczmprim_GHCziTypes_Dzh_static_info' '-u' 'base_GHCziPtr_Ptr_static_info' '-u' 'base_GHCziWord_Wzh_static_info' '-u' 'base_GHCziInt_I8zh_static_info' '-u' 'base_GHCziInt_I16zh_static_info' '-u' 'base_GHCziInt_I32zh_static_info' '-u' 'base_GHCziInt_I64zh_static_info' '-u' 'base_GHCziWord_W8zh_static_info' '-u' 'base_GHCziWord_W16zh_static_info' '-u' 'base_GHCziWord_W32zh_static_info' '-u' 'base_GHCziWord_W64zh_static_info' '-u' 'base_GHCziStable_StablePtr_static_info' '-u' 'ghczmprim_GHCziTypes_Izh_con_info' '-u' 'ghczmprim_GHCziTypes_Czh_con_info' '-u' 'ghczmprim_GHCziTypes_Fzh_con_info' '-u' 'ghczmprim_GHCziTypes_Dzh_con_info' '-u' 'base_GHCziPtr_Ptr_con_info' '-u' 'base_GHCziPtr_FunPtr_con_info' '-u' 'base_GHCziStable_StablePtr_con_info' '-u' 'ghczmprim_GHCziTypes_False_closure' '-u' 'ghczmprim_GHCziTypes_True_closure' '-u' 'base_GHCziPack_unpackCString_closure' '-u' 'base_GHCziIOziException_stackOverflow_closure' '-u' 'base_GHCziIOziException_heapOverflow_closure' '-u' 'base_ControlziExceptionziBase_nonTermination_closure' '-u' 'base_GHCziIOziException_blockedIndefinitelyOnMVar_closure' '-u' 'base_GHCziIOziException_blockedIndefinitelyOnSTM_closure' '-u' 'base_ControlziExceptionziBase_nestedAtomically_closure' '-u' 'base_GHCziWeak_runFinalizzerBatch_closure' '-u' 'base_GHCziTopHandler_flushStdHandles_closure' '-u' 'base_GHCziTopHandler_runIO_closure' '-u' 'base_GHCziTopHandler_runNonIO_closure' '-u' 'base_GHCziConcziIO_ensureIOManagerIsRunning_closure' '-u' 'base_GHCziConcziSync_runSparks_closure' '-u' 'base_GHCziConcziSignal_runHandlers_closure'
      link: done
      *** Deleting temp files:
      Deleting: /tmp/ghc4784_0/ghc4784_1.o /tmp/ghc4784_0/ghc4784_0.s /tmp/ghc4784_0/ghc4784_0.o /tmp/ghc4784_0/ghc4784_0.c
      *** Deleting temp dirs:
      Deleting: /tmp/ghc4784_0
      

      Looking from the first *** Simplifier: to the last, where all the optimization phases happen, we see quite a lot.

      First of all, the Simplifier runs between almost all the phases. This makes writing many passes much easier. For example, when implementing many optimizations, they simply create rewrite rules to propagate the changes instead of having to do it manually. The simplifier encompasses a number of simple optimizations, including inlining and fusion. The main limitation of this that I know is that GHC refuses to inline recursive functions, and that things have to be named correctly for fusion to work.

      Next, we see a full list of all the optimizations performed:

      • Specialise

        The basic idea of specialization is to remove polymorphism and overloading by identifying places where the function is called and creating versions of the function that aren't polymorphic - they are specific to the types they are called with. You can also tell the compiler to do this with the SPECIALISE pragma. As an example, take a factorial function:

        fac :: (Num a, Eq a) => a -> a
        fac 0 = 1
        fac n = n * fac (n - 1)
        

        As the compiler doesn't know any properties of the multiplication that is to be used, it cannot optimize this at all. If however, it sees that it is used on an Int, it now can create a new version, differing only in the type:

        fac_Int :: Int -> Int
        fac_Int 0 = 1
        fac_Int n = n * fac_Int (n - 1)
        

        Next, rules mentioned below can fire, and you end up with something working on unboxed Ints, which is much faster than the original. Another way to look at specialisation is partial application on type class dictionaries and type variables.

        The source here has a load of notes in it.

      • Float out

        EDIT: I apparently misunderstood this before. My explanation has completely changed.

        The basic idea of this is to move computations that shouldn't be repeated out of functions. For example, suppose we had this:

        \x -> let y = expensive in x+y
        

        In the above lambda, every time the function is called, y is recomputed. A better function, which floating out produces, is

        let y = expensive in \x -> x+y
        

        To facilitate the process, other transformations may be applied. For example, this happens:

         \x -> x + f 2
         \x -> x + let f_2 = f 2 in f_2
         \x -> let f_2 = f 2 in x + f_2
         let f_2 = f 2 in \x -> x + f_2
        

        Again, repeated computation is saved.

        The source is very readable in this case.

        At the moment bindings between two adjacent lambdas are not floated. For example, this does not happen:

        \x y -> let t = x+x in ...
        

        going to

         \x -> let t = x+x in \y -> ...
        

      • Float inwards

        Quoting the source code,

        The main purpose of floatInwards is floating into branches of a case, so that we don't allocate things, save them on the stack, and then discover that they aren't needed in the chosen branch.

        As an example, suppose we had this expression:

        let x = big in
            case v of
                True -> x + 1
                False -> 0
        

        If v evaluates to False, then by allocating x, which is presumably some big thunk, we have wasted time and space. Floating inwards fixes this, producing this:

        case v of
            True -> let x = big in x + 1
            False -> let x = big in 0
        

        , which is subsequently replaced by the simplifier with

        case v of
            True -> big + 1
            False -> 0
        

        This paper, although covering other topics, gives a fairly clear introduction. Note that despite their names, floating in and floating out don't get in an infinite loop for two reasons:

        1. Float in floats lets into case statements, while float out deals with functions.
        2. There is a fixed order of passes, so they shouldn't be alternating infinitely.

      • Demand analysis

        Demand analysis, or strictness analysis is less of a transformation and more, like the name suggests, of an information gathering pass. The compiler finds functions that always evaluate their arguments (or at least some of them), and passes those arguments using call-by-value, instead of call-by-need. Since you get to evade the overheads of thunks, this is often much faster. Many performance problems in Haskell arise from either this pass failing, or code simply not being strict enough. A simple example is the difference between using foldr, foldl, and foldl' to sum a list of integers - the first causes stack overflow, the second causes heap overflow, and the last runs fine, because of strictness. This is probably the easiest to understand and best documented of all of these. I believe that polymorphism and CPS code often defeat this.

      • Worker Wrapper binds

        The basic idea of the worker/wrapper transformation is to do a tight loop on a simple structure, converting to and from that structure at the ends. For example, take this function, which calculates the factorial of a number.

        factorial :: Int -> Int
        factorial 0 = 1
        factorial n = n * factorial (n - 1)
        

        Using the definition of Int in GHC, we have

        factorial :: Int -> Int
        factorial (I# 0#) = I# 1#
        factorial (I# n#) = I# (n# *# case factorial (I# (n# -# 1#)) of
            I# down# -> down#)
        

        Notice how the code is covered in I#s? We can remove them by doing this:

        factorial :: Int -> Int
        factorial (I# n#) = I# (factorial# n#)
        
        factorial# :: Int# -> Int#
        factorial# 0# = 1#
        factorial# n# = n# *# factorial# (n# -# 1#)
        

        Although this specific example could have also been done by SpecConstr, the worker/wrapper transformation is very general in the things it can do.

      • Common sub-expression

        This is another really simple optimization that is very effective, like strictness analysis. The basic idea is that if you have two expressions that are the same, they will have the same value. For example, if fib is a Fibonacci number calculator, CSE will transform

        fib x + fib x
        

        into

        let fib_x = fib x in fib_x + fib_x
        

        which cuts the computation in half. Unfortunately, this can occasionally get in the way of other optimizations. Another problem is that the two expressions have to be in the same place and that they have to be syntactically the same, not the same by value. For example, CSE won't fire in the following code without a bunch of inlining:

        x = (1 + (2 + 3)) + ((1 + 2) + 3)
        y = f x
        z = g (f x) y
        

        However, if you compile via llvm, you may get some of this combined, due to its Global Value Numbering pass.

      • Liberate case

        This seems to be a terribly documented transformation, besides the fact that it can cause code explosion. Here is a reformatted (and slightly rewritten) version of the little documentation I found:

        This module walks over Core, and looks for case on free variables. The criterion is: if there is a case on a free variable on the route to the recursive call, then the recursive call is replaced with an unfolding. For example, in

        f = \ t -> case v of V a b -> a : f t
        

        the inner f is replaced. to make

        f = \ t -> case v of V a b -> a : (letrec f = \ t -> case v of V a b -> a : f t in f) t
        

        Note the need for shadowing. Simplifying, we get

        f = \ t -> case v of V a b -> a : (letrec f = \ t -> a : f t in f t)
        

        This is better code, because a is free inside the inner letrec, rather than needing projection from v. Note that this deals with free variables, unlike SpecConstr, which deals with arguments that are of known form.

        See below for more information about SpecConstr.

      • SpecConstr - this transforms programs like

        f (Left x) y = somthingComplicated1
        f (Right x) y = somethingComplicated2
        

        into

        f_Left x y = somethingComplicated1
        f_Right x y = somethingComplicated2
        
        {-# INLINE f #-}
        f (Left x) = f_Left x
        f (Right x) = f_Right x
        

        As an extended example, take this definition of last:

        last [] = error "last: empty list"
        last (x:[]) = x
        last (x:x2:xs) = last (x2:xs)
        

        We first transform it to

        last_nil = error "last: empty list"
        last_cons x [] = x
        last_cons x (x2:xs) = last (x2:xs)
        
        {-# INLINE last #-}
        last [] = last_nil
        last (x : xs) = last_cons x xs
        

        Next, the simplifier runs, and we have

        last_nil = error "last: empty list"
        last_cons x [] = x
        last_cons x (x2:xs) = last_cons x2 xs
        
        {-# INLINE last #-}
        last [] = last_nil
        last (x : xs) = last_cons x xs
        

        Note that the program is now faster, as we are not repeatedly boxing and unboxing the front of the list. Also note that the inlining is crucial, as it allows the new, more efficient definitions to actually be used, as well as making recursive definitions better.

        SpecConstr is controlled by a number of heuristics. The ones mentioned in the paper are as such:

        1. The lambdas are explicit and the arity is a.
        2. The right hand side is "sufficiently small," something controlled by a flag.
        3. The function is recursive, and the specializable call is used in the right hand side.
        4. All of the arguments to the function are present.
        5. At least one of the arguments is a constructor application.
        6. That argument is case-analysed somewhere in the function.

        However, the heuristics have almost certainly changed. In fact, the paper mentions an alternative sixth heuristic:

        Specialise on an argument x only if x is only scrutinised by a case, and is not passed to an ordinary function, or returned as part of the result.

      This was a very small file (12 lines) and so possibly didn't trigger that many optimizations (though I think it did them all). This also doesn't tell you why it picked those passes and why it put them in that order.

      这篇关于GHC可以预期哪些优化能够可靠地执行?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆