GHC 7.10比旧版本产生较慢的代码 [英] GHC 7.10 generates slower code than older versions

查看:77
本文介绍了GHC 7.10比旧版本产生较慢的代码的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我意识到最新版本的GHC(7.10.3)产生的代码比旧版本慢得多。我目前的版本截至目前:

  $ ghc --version 
Glorious Glasgow Haskell编译系统,版本7.10。 3

我的本​​地机器上还安装了另外两个旧版本。



我的测试代码来自这里 collat​​z1.hs code):

  import Data.Word 
import Data.List
import System.Environment

collat​​zNext :: Word32 - > Word32
collat​​zNext a =(如果连a a else 3 * a + 1)`div` 2

- 新代码
collat​​zLen :: Word32 - > Int
collat​​zLen a0 = lenIterWhile collat​​zNext(/ = 1)a0

lenIterWhile ::(a - > a) - > (a - > Bool) - > a - > Int
lenIterWhile next notDone start = len start 0 where
len nm = if notDone n
then len(next n)(m + 1)
else m
- - 新代码结束

main = do
[a0]< - getArgs
let max_a0 =(读取a0):: Word32
print $ maximum $ map (\ a0 - >(collat​​zLen a0,a0))[1..max_a0]

GHC 7.4,7.6和7.10会产生以下时间:

$ $ p $ $ / $ g $ c $ / $ ghc-7.4.2 / bin / ghc -O2 Test.hs
[1的1]编译Main(Test.hs,Test.o)
链接测试...

$ time ./Test 1000000
(329,837799)

real 0m1.879s
user 0m1.876s
sys 0m0.000s


$ b

  $〜/ Tools / ghc-7.6.1 / bin / ghc -O2 Test.hs 
[1的1]编译Main(Test.hs,Test.o)
链接测试...

$ time ./Test 1000000
(329, 837799)

real 0m1.901s
user 0m1.896s
sys 0m0.000s





  $〜/ Tools / ghc / bin / ghc -O2 Test.hs 
[1 of 1]编译Main(Test.hs,Test.o)
Linking Test .. 。

$ time ./Test 1000000
(329,837799)

real 0m10.562s
user 0m10.528s
sys 0m0 .036s

我们可以说毫无疑问,最新版本的GHC会产生比旧版本更差的代码两个版本。我无法重现与博客相同的效率,但可能是因为我没有LLVM,并且没有作者使用的确切版本。但我仍然相信结论是显而易见的。



总的来说,我的问题是为什么会发生这种情况?不知怎的,GHC比以前变得更糟。特别是,如果我想调查,我该如何让自己开始? 这里是两个配置文件的比较( diff Test-GHC-7-8-4.prof Test-GHC-7-10-3.prof

  1c1 
<星期五3月11日19:58时间和分配分析报告(最终)
---
>星期五3月11日19:59时间和分配分析报告(最终)
5,6c5,6
<总时间= 2.40秒(2400个滴答@ 1000us,1个处理器)
<总分配= 256,066,744字节(不包括分析费用)
---
>总时间= 10.89秒(10895个滴答@ 1000 us,1个处理器)
>总分配= 15,713,590,808字节(不包括分析费用)
10,13c10,12
< lenIterWhile.len Main 93.8 0.0
< collat​​zMax Main 2.2 93.7
< collat​​zNext Main 2.0 0.0
< lenIterWhile Main 1.5 6.2
---
> collat​​zNext Main 79.6 89.4
> lenIterWhile.len Main 18.9 8.8
> collat​​zMax Main 0.8 1.5

有些事情很奇怪。在GHC lenIterWhile.len 大部分时间里, collat​​zNext 现在是罪魁祸首。让我们来看看转储的核心:

   -  GHC 7.8.4 
Rec {
Main 。$ wlen [Occ = LoopBreaker]
:: GHC.Prim.Word# - > GHC.Prim.Int# - > GHC.Prim.Int#
[GblId,Arity = 2,Caf = NoCafRefs,Str = DmdType< S,1 * U> ]
Main。$ wlen =
\(ww_s4Mn :: GHC.Prim.Word#)(ww1_s4Mr :: GHC.Prim.Int#) - >
案例wild_XQ的ww_s4Mn {
__DEFAULT - >
案例GHC.Prim.remWord#wild_XQ(__word 2)of _ [Occ = Dead] {
__DEFAULT - >
Main。$ wlen
(GHC.Prim.quotWord#
(GHC.Prim.narrow32Word#
(GHC.Prim.plusWord#
(GHC.Prim。 (GHC.Prim.timesWord#(__word 3)wild_XQ))
(__word 1)))
(__word 2))
(GHC.Prim。+#ww1_s4Mr 1);
__word 0 - >
Main。$ wlen
(GHC.Prim.quotWord#wild_XQ(__word 2))(GHC.Prim。+#ww1_s4Mr 1)
};
__word 1 - > ww1_s4Mr
}
结束记录}

似乎或多或少是合理的。现在关于GHC 7.10.3:

 Rec {
$ wlen_r6Sy :: GHC.Prim.Word# - > GHC.Prim.Int# - > GHC.Prim.Int#
[GblId,Arity = 2,Str = DmdType< S,U>< L,U>]
$ wlen_r6Sy =
\(ww_s60s :: GHC.Prim.Word#)(ww1_s60w :: GHC.Prim.Int#) - >
case ww_s60s of wild_X1Z {
__DEFAULT - >
个案甚至
@ Word32 GHC.Word。$ fIntegralWord32(GHC.Word.W32#wild_X1Z)
of _ [Occ = Dead] {

False - >
$ wlen_r6Sy
(GHC.Prim.quotWord#
(GHC.Prim.narrow32Word#
(GHC.Prim.plusWord#
(GHC.Prim.narrow32Word# (GHC.Prim.timesWord#(__word 3)wild_X1Z))
(__word 1)))
(__word 2))
(GHC.Prim。+#ww1_s60w 1);
True - >
$ wlen_r6Sy
(GHC.Prim.quotWord#wild_X1Z(__word 2))(GHC.Prim。+#ww1_s60w 1)
};
__word 1 - > ww1_s60w
}
结束记录}

好的,好像是一样的。除了甚至的调用外。让我们将甚至替换为 Integral 的其中一个内联变体,例如 x rem 2 == 0

  import Data.Word 
import Data.List
import System.Environment

collat​​zNext :: Word32 - > Word32
collat​​zNext a =(如果一个`rem` 2 == 0然后一个else 3 * a + 1)`div` 2

- 其余代码相同

让我们再次用分析和检查进行编译:

  $ stack --resolver = ghc-7.10 ghc  -  Test.hs -O2 -fforce-recomp -prof -fprof-auto -auto-all 
$ ./Test + RTS - s -p -RTS
(329,837799)
在堆中分配的416,119,240字节
在GC
期间复制的69,760字节最大居民地址(2个样本)
21,912字节最大值slb
使用的总内存1 MB(由于分段造成的0 MB丢失)

总时间(已用)平均暂停最大值暂停
Gen 0 800 colls, 0标准差0.000s 0.002s 0.0000s 0.0001s
Gen 1 2 colls,0 par 0.000s 0.000s 0.0002s 0.0003s

初始时间0.000s(已过0.019s)
MUT时间2.500s(经过2.546s)
GC时间0.000s(经过0.003s)
RP时间0.000s(经过0.000s)
PROF时间0.000s(经过0.000s)
EXIT时间0.000s(已过0.000s)
总时间2.500s(经过2.567s)

%GC时间0.0%(已过0.1%)

分配给每个MUT的166,447,696个字节秒

生产力总用户的100.0%,已用完总数的97.4%

$ cat Test.prof
Fri Mar 11 20:22 2016时间和分配分析报告(最终)

Test.exe + RTS -s -p -RTS 1000000

总时间= 2.54秒(2535 ticks @ 1000 us,1个处理器)
总alloc = 256,066,984字节(不包括分析费用)

成本中心模块%时间%分配

lenIterWhile.len Main 94.4 0.0
main Main 1.9 93.7
collat​​zNext主1.8 0.0
lenIterWhile Main 1.3 6.2

个人继承
成本中心模块没有。项目%时间%分配%时间%分配

主要主要44 0 0.0 0.0 100.0 100.0
主要主要89 0 1.9 93.7 100.0 100.0
main.\ Main 92 1000000 0.4 0.0 98.1 6.2
collat​​zLen Main 93 1000000 0.2 0.0 97.8 6.2
lenIterWhile Main 94 1000000 1.3 6.2 97.5 6.2
lenIterWhile.len Main 95 88826840 94.4 0.0 96.2 0.0
collat​​zNext主96 87826840 1.8 0.0 1.8 0.0
main.max_a0 Main 90 1 0.0 0.0 0.0 0.0
CAF GHC.IO.Encoding.CodePage 73 0 0.0 0.0 0.0 0.0
CAF System.Environment 64 0 0.0 0.0 0.0 0.0
CAF GHC.IO.Handle.Text 62 0 0.0 0.0 0.0 0.0
CAF GHC.IO.Encoding 61 0 0.0 0.0 0.0 0.0

看起来像是修正了它。所以问题在于GHC-7.8内联甚至,而GHC-7.10没有。这是由于添加了 { - #SPECIALIZE even :: x - > x - > Bool# - } 规则适用于 Int 整数,这些规则不允许内联。

由于问题的讨论文件制作偶数奇数 { - #INLINEABLE ...# - } 可以解决此问题。请注意,专业化本身因性能原因而添加


I realized that the latest version of GHC (7.10.3) produces significantly slower code than an older version. My current version as of now:

$ ghc --version
The Glorious Glasgow Haskell Compilation System, version 7.10.3

I have also two other old versions installed on my local machine.

My test code is taken from here (the collatz1.hs code):

import Data.Word
import Data.List
import System.Environment

collatzNext :: Word32 -> Word32
collatzNext a = (if even a then a else 3*a+1) `div` 2

-- new code
collatzLen :: Word32 -> Int
collatzLen a0 = lenIterWhile collatzNext (/= 1) a0

lenIterWhile :: (a -> a) -> (a -> Bool) -> a -> Int
lenIterWhile next notDone start = len start 0 where
    len n m = if notDone n
                then len (next n) (m+1)
                else m
-- End of new code

main = do
    [a0] <- getArgs
    let max_a0 = (read a0)::Word32
    print $ maximum $ map (\a0 -> (collatzLen a0, a0)) [1..max_a0]

Compiling with GHC 7.4, 7.6 and 7.10 yields the following times:

$ ~/Tools/ghc-7.4.2/bin/ghc -O2 Test.hs 
[1 of 1] Compiling Main             ( Test.hs, Test.o )
Linking Test ...

$ time ./Test 1000000
(329,837799)

real    0m1.879s
user    0m1.876s
sys     0m0.000s

$ ~/Tools/ghc-7.6.1/bin/ghc -O2 Test.hs 
[1 of 1] Compiling Main             ( Test.hs, Test.o )
Linking Test ...

$ time ./Test 1000000
(329,837799)

real    0m1.901s
user    0m1.896s
sys     0m0.000s

$ ~/Tools/ghc/bin/ghc -O2 Test.hs 
[1 of 1] Compiling Main             ( Test.hs, Test.o )
Linking Test ...

$ time ./Test 1000000
(329,837799)

real    0m10.562s
user    0m10.528s
sys     0m0.036s    

We can tell there is no doubt that the latest version of GHC produces worse code than the older two versions. I can't reproduce the same efficiency as the blog though probably because I don't have LLVM and Idon't have the exact version the author used. But still, I believe the conclusion is obvious.

My question is, in general, why this could happen? Somehow GHC becomes worse than it used to be. And specifically, if I want to investigate, how should I get myself started?

解决方案

Here's a comparison of both profiles (diff Test-GHC-7-8-4.prof Test-GHC-7-10-3.prof)

1c1                               
<       Fri Mar 11 19:58 2016 Time and Allocation Profiling Report  (Final)
---                               
>       Fri Mar 11 19:59 2016 Time and Allocation Profiling Report  (Final)
5,6c5,6                               
<       total time  =        2.40 secs   (2400 ticks @ 1000 us, 1 processor)
<       total alloc = 256,066,744 bytes  (excludes profiling overheads)
---                               
>       total time  =       10.89 secs   (10895 ticks @ 1000 us, 1 processor)
>       total alloc = 15,713,590,808 bytes  (excludes profiling overheads)
10,13c10,12                               
< lenIterWhile.len Main     93.8   0.0                    
< collatzMax       Main      2.2   93.7
< collatzNext      Main      2.0    0.0
< lenIterWhile     Main      1.5    6.2
---                                
> collatzNext      Main     79.6   89.4
> lenIterWhile.len Main     18.9    8.8
> collatzMax       Main      0.8    1.5

There's something very strange going on. While in GHC lenIterWhile.len was taking most of the time, collatzNext is now the culprit. Let's have a look at the dumped core:

-- GHC 7.8.4
Rec {
Main.$wlen [Occ=LoopBreaker]
  :: GHC.Prim.Word# -> GHC.Prim.Int# -> GHC.Prim.Int#
[GblId, Arity=2, Caf=NoCafRefs, Str=DmdType <S,1*U><L,U>]
Main.$wlen =
  \ (ww_s4Mn :: GHC.Prim.Word#) (ww1_s4Mr :: GHC.Prim.Int#) ->
    case ww_s4Mn of wild_XQ {
      __DEFAULT ->
        case GHC.Prim.remWord# wild_XQ (__word 2) of _ [Occ=Dead] {
          __DEFAULT ->
            Main.$wlen
              (GHC.Prim.quotWord#
                 (GHC.Prim.narrow32Word#
                    (GHC.Prim.plusWord#
                       (GHC.Prim.narrow32Word# (GHC.Prim.timesWord# (__word 3) wild_XQ))
                       (__word 1)))
                 (__word 2))
              (GHC.Prim.+# ww1_s4Mr 1);
          __word 0 ->
            Main.$wlen
              (GHC.Prim.quotWord# wild_XQ (__word 2)) (GHC.Prim.+# ww1_s4Mr 1)
        };
      __word 1 -> ww1_s4Mr
    }
end Rec }

Seems more or less reasonable. Now about GHC 7.10.3:

Rec {
$wlen_r6Sy :: GHC.Prim.Word# -> GHC.Prim.Int# -> GHC.Prim.Int#
[GblId, Arity=2, Str=DmdType <S,U><L,U>]
$wlen_r6Sy =
  \ (ww_s60s :: GHC.Prim.Word#) (ww1_s60w :: GHC.Prim.Int#) ->
    case ww_s60s of wild_X1Z {
      __DEFAULT ->
        case even
               @ Word32 GHC.Word.$fIntegralWord32 (GHC.Word.W32# wild_X1Z)
        of _ [Occ=Dead] {
          False ->
            $wlen_r6Sy
              (GHC.Prim.quotWord#
                 (GHC.Prim.narrow32Word#
                    (GHC.Prim.plusWord#
                       (GHC.Prim.narrow32Word# (GHC.Prim.timesWord# (__word 3) wild_X1Z))
                       (__word 1)))
                 (__word 2))
              (GHC.Prim.+# ww1_s60w 1);
          True ->
            $wlen_r6Sy
              (GHC.Prim.quotWord# wild_X1Z (__word 2)) (GHC.Prim.+# ww1_s60w 1)
        };
      __word 1 -> ww1_s60w
    }
end Rec }

Allright, seems like it's the same. Except for the call of even. Let's replace even with one of the inline variants of Integral, e.g. x rem 2 == 0:

import Data.Word
import Data.List
import System.Environment

collatzNext :: Word32 -> Word32
collatzNext a = (if a `rem` 2 == 0 then a else 3*a+1) `div` 2

-- rest of code the same

Let's compile it again with profiling and check:

$ stack --resolver=ghc-7.10 ghc -- Test.hs -O2 -fforce-recomp -prof -fprof-auto -auto-all
$ ./Test +RTS -s -p -RTS 
(329,837799)
     416,119,240 bytes allocated in the heap
          69,760 bytes copied during GC
          59,368 bytes maximum residency (2 sample(s))
          21,912 bytes maximum slop
               1 MB total memory in use (0 MB lost due to fragmentation)

                                     Tot time (elapsed)  Avg pause  Max pause
  Gen  0       800 colls,     0 par    0.000s   0.002s     0.0000s    0.0001s
  Gen  1         2 colls,     0 par    0.000s   0.000s     0.0002s    0.0003s

  INIT    time    0.000s  (  0.019s elapsed)
  MUT     time    2.500s  (  2.546s elapsed)
  GC      time    0.000s  (  0.003s elapsed)
  RP      time    0.000s  (  0.000s elapsed)
  PROF    time    0.000s  (  0.000s elapsed)
  EXIT    time    0.000s  (  0.000s elapsed)
  Total   time    2.500s  (  2.567s elapsed)

  %GC     time       0.0%  (0.1% elapsed)

  Alloc rate    166,447,696 bytes per MUT second

  Productivity 100.0% of total user, 97.4% of total elapsed

$ cat Test.prof
        Fri Mar 11 20:22 2016 Time and Allocation Profiling Report  (Final)

           Test.exe +RTS -s -p -RTS 1000000

        total time  =        2.54 secs   (2535 ticks @ 1000 us, 1 processor)
        total alloc = 256,066,984 bytes  (excludes profiling overheads)

COST CENTRE      MODULE  %time %alloc

lenIterWhile.len Main     94.4    0.0
main             Main      1.9   93.7
collatzNext      Main      1.8    0.0
lenIterWhile     Main      1.3    6.2

                                                                   individual     inherited
COST CENTRE           MODULE                     no.     entries  %time %alloc   %time %alloc

MAIN                  MAIN                        44           0    0.0    0.0   100.0  100.0
 main                 Main                        89           0    1.9   93.7   100.0  100.0
  main.\              Main                        92     1000000    0.4    0.0    98.1    6.2
   collatzLen         Main                        93     1000000    0.2    0.0    97.8    6.2
    lenIterWhile      Main                        94     1000000    1.3    6.2    97.5    6.2
     lenIterWhile.len Main                        95    88826840   94.4    0.0    96.2    0.0
      collatzNext     Main                        96    87826840    1.8    0.0     1.8    0.0
  main.max_a0         Main                        90           1    0.0    0.0     0.0    0.0
 CAF                  GHC.IO.Encoding.CodePage    73           0    0.0    0.0     0.0    0.0
 CAF                  System.Environment          64           0    0.0    0.0     0.0    0.0
 CAF                  GHC.IO.Handle.Text          62           0    0.0    0.0     0.0    0.0
 CAF                  GHC.IO.Encoding             61           0    0.0    0.0     0.0    0.0

Seems like that fixed it. So the problem is that GHC-7.8 inlines even, while GHC-7.10 doesn't. This happens due to added {-# SPECIALISE even :: x -> x -> Bool #-} rules for Int and Integer, which don't allow inlining.

As issue's discussion documents making even and odd {-# INLINEABLE ... #-} would resolve this issue. Note that the specialisation itself was added for perfomance reasons.

这篇关于GHC 7.10比旧版本产生较慢的代码的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆