编译加速代码的性能差异来自ghci和shell [英] Difference in performance of compiled accelerate code ran from ghci and shell

查看:150
本文介绍了编译加速代码的性能差异来自ghci和shell的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述



你好,我正在使用加速库来创建一个允许用户交互式地调用处理图像的函数的应用程序,这就是为什么我要使用ghc api来扩展ghci。



问题是,当从shell运行编译的可执行文件时,计算在100ms之内完成(略小于80),同时在ghci中运行相同的编译代码,它需要100ms(平均超过140个)才能完成。



资源

示例代码+执行日志:
https://gist.github.com/zgredzik/15a437c87d3d8d03b8fc



描述



首先:测试是在CUDA内核编译后运行的(编译本身额外增加了2秒,但事实并非如此)。



运行时从shell编译的可执行文件计算在10ms以内完成。 ( shell首次运行 second shell run 传递了不同的参数以确保数据未被缓存到任何地方)。

当试图从ghci运行相同的代码并且处理输入数据时,计算超过了100ms。我知道解释代码比编译代码慢,但我在ghci会话中加载了相同的编译代码并调用了相同的顶级绑定( packedFunction )。我明确地输入了它,以确保它是专用的(与使用SPECIALIZED pragma相同的结果)。

但是,如果运行 main ghci中的函数(即使在连续调用之间用:set args 更改输入数据)。



编译 Main.hs ghc -o main Main.hs -O2 -dynamic -threaded code>



我想知道哪里有开销。有人对此有何建议吗?






remdezx

  { - #LANGUAGE OverloadedStrings# - } 

模块Main其中

导入Data.Array.Accelerate作为A
导入Data.Array.Accelerate.CUDA作为C
导入Data.Time.Clock(diffUTCTime,getCurrentTime)

main :: IO()
main = do
start< - getCurrentTime
print $ C.run $ A.maximum $ A.map(+1)$ A.use(fromList(Z:.1000000)[1..1000000] :: Vector Double)
end< - getCurrentTime
print $ diffUTCTime end start

当我编译并执行它时,需要 0,09s 完成。

  $ ghc -O2 Main.hs -o main -threaded 
[编译1] Main.hs,Main.o)
链接main ...
$ ./main
Array(Z) [1000001.0]
0.092906s

但是,当我预编译它并在解释器中运行时, strong> 0.25s

  $ ghc -O2 Main.hs -c -dynamic 
$ ghci主
ghci> main
Array(Z)[1000001.0]
0.258224s


解决方案我调查了加速 accele-cuda 并且放置了一些调试代码来测量时间在ghci下,并在编译,优化版本。



结果如下,您可以看到堆栈跟踪和执行时间。


$ b

ghci run

  $ ghc -O2 -dynamic -c -threaded Main.hs&& ghci 
GHCi,版本7.8.3:http://www.haskell.org/ghc/:?寻求帮助
...
加载包ghc-prim ...链接...完成。
加载包integer-gmp ...链接...完成。
加载程序包库...链接...完成。
好​​的,加载的模块:Main。
Prelude Main>加载包变形金刚 - 0.3.0.0 ...链接...完成。
...
加载软件包array-0.5.0.0 ...链接...完成。
(...)
加载包加速 - cuda-0.15.0.0 ...链接...完成。
>>>>>运行
>>>>> runAsyncIn.execute
>>>>> runAsyncIn.seq ctx
<<<<<< runAsyncIn.seq ctx:4.1609e-2 CPU 0.041493s TOTAL
>>>>> runAsyncIn.seq a
<<<<<< runAsyncIn.seq a:1.0e-6 CPU 0.000001s TOTAL
>>>>> runAsyncIn.seq acc
>>>>> convertAccWith True
<<<<<< convertAccWith:0.0 CPU 0.000017s TOTAL
<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<< runAsyncIn.seq acc:2.68e-4 CPU 0.000219s TOTAL
>>>>> evalCUDA
>>>>>推
>>>>> evalStateT
>>>>> runAsyncIn.compileAcc
>>>>> compileOpenAcc
>>>>> compileOpenAcc.traveuseAcc.Alet
>>>>> compileOpenAcc.traveuseAcc.Use
>>>>> compileOpenAcc.traveuseAcc.use3
>>>>> compileOpenAcc.traveuseAcc.use1
<<<<<< compileOpenAcc.traveuseAcc.use1:0.0 CPU 0.000001s TOTAL
>>>>> compileOpenAcc.traveuseAcc.use2
>>>>> compileOpenAcc.traveuseAcc.seq arr
<<<<<< compileOpenAcc.traveuseAcc.seq arr:0.105716 CPU 0.105501s TOTAL
>>>>> useArrayAsync
<<<<<< useArrayAsync:1.234e-3CPU 0.001505s总计
<<< < compileOpenAcc.traveuseAcc.use2:0.108012 CPU 0.108015s TOTAL
<<< < compileOpenAcc.traveuseAcc.use3:0.108539 CPU 0.108663s TOTAL
<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<< compileOpenAcc.traveuseAcc.Use:0.109375 CPU 0.109005s TOTAL
>>>>> compileOpenAcc.traveuseAcc.Fold1
>>>>> compileOpenAcc.traveuseAcc.Avar
<<<<<< compileOpenAcc.traveuseAcc.Avar:0.0 CPU 0.000001s TOTAL
>>>>> compileOpenAcc.traveuseAcc.Avar
<<<<<< compileOpenAcc.traveuseAcc.Avar:0.0 CPU 0s TOTAL
>>>>> compileOpenAcc.traveuseAcc.Avar
<<<<<< compileOpenAcc.traveuseAcc.Avar:0.0 CPU 0.000001s TOTAL
>>>>> compileOpenAcc.traveuseAcc.Avar
<<<<<< compileOpenAcc.traveuseAcc.Avar:0.0 CPU 0s TOTAL
<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<< compileOpenAcc.traveuseAcc.Fold1:2.059e-3 CPU 0.002384s总计
<<<<< compileOpenAcc.traveuseAcc.Alet:0.111434 CPU 0.112034s TOTAL
<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<< compileOpenAcc:0.11197 CPU 0.112615s总计
<<< < runAsyncIn.compileAcc:0.11197 CPU 0.112833s TOTAL
>>>>> runAsyncIn.dumpStats
<<<<<< runAsyncIn.dumpStats:2.0e-6 CPU 0.000001s TOTAL
>>>>> runAsyncIn.executeAcc
>>>>> executeAcc
<<<<<< executeAcc:8.96e-4 CPU 0.00049s总计
<< 1 < runAsyncIn.executeAcc:9.36e-4 CPU 0.0007s TOTAL
>>>>> runAsyncIn.collect
<<<<<< runAsyncIn.collect:0.0 CPU 0.000027s总计
<<<< < evalStateT:0.114156 CPU 0.115327s TOTAL
>>>>>流行
<>>>>>执行GC
< << 1 < eval CUDA:0.17295 CPU 0.173943s总计
<< 1 < runAsyncIn.execute:0.215475 CPU 0.216563s总计
<< <<<运行:0.215523 CPU 0.216771s TOTAL
Array(Z)[1000001.0]
0.217148s
Prelude Main>离开GHCi。

编译后的程式码

  $ ghc -O2 -threaded Main.hs&& ./Main 
[1的1]编译Main(Main.hs,Main.o)
链接Main ...
>>>>>运行
>>>>> runAsyncIn.execute
>>>>> runAsyncIn.seq ctx
<<<<<< runAsyncIn.seq ctx:4.0639e-2 CPU 0.041498s TOTAL
>>>>> runAsyncIn.seq a
<<<<<< runAsyncIn.seq a:1.0e-6 CPU 0.000001s TOTAL
>>>>> runAsyncIn.seq acc
>>>>> convertAccWith True
<<<<<< convertAccWith:1.2e-5 CPU 0.000005s总计
<< 1 < runAsyncIn.seq acc:1.15e-4 CPU 0.000061s TOTAL
>>>>> evalCUDA
>>>>>推
>>>>> evalStateT
>>>>> runAsyncIn.compileAcc
>>>>> compileOpenAcc
>>>>> compileOpenAcc.traveuseAcc.Alet
>>>>> compileOpenAcc.traveuseAcc.Use
>>>>> compileOpenAcc.traveuseAcc.use3
>>>>> compileOpenAcc.traveuseAcc.use1
<<<<<< compileOpenAcc.traveuseAcc.use1:0.0 CPU 0.000001s TOTAL
>>>>> compileOpenAcc.traveuseAcc.use2
>>>>> compileOpenAcc.traveuseAcc.seq arr
<<<<<< compileOpenAcc.traveuseAcc.seq arr:3.6651e-2 CPU 0.03712s TOTAL
>>>>> useArrayAsync
<<<<<< useArrayAsync:1.427e-3 CPU 0.001427s TOTAL
<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<< compileOpenAcc.traveuseAcc.use2:3.8776e-2 CPU 0.039152s总计
<<<<<< compileOpenAcc.traveuseAcc.use3:3.8794e-2 CPU 0.039207s TOTAL
<<< < compileOpenAcc.traveuseAcc.Use:3.8808e-2 CPU 0.03923s TOTAL
>>>>> compileOpenAcc.traveuseAcc.Fold1
>>>>> compileOpenAcc.traveuseAcc.Avar
<<<<<< compileOpenAcc.traveuseAcc.Avar:2.0e-6 CPU 0.000001s TOTAL
>>>>> compileOpenAcc.traveuseAcc.Avar
<<<<<< compileOpenAcc.traveuseAcc.Avar:2.0e-6 CPU 0.000001s TOTAL
>>>>> compileOpenAcc.traveuseAcc.Avar
<<<<<< compileOpenAcc.traveuseAcc.Avar:0.0 CPU 0.000001s TOTAL
>>>>> compileOpenAcc.traveuseAcc.Avar
<<<<<< compileOpenAcc.traveuseAcc.Avar:0.0 CPU 0.000001s总计
<<< < compileOpenAcc.traveuseAcc.Fold1:1.342e-3 CPU 0.001284s总计
<<<<< compileOpenAcc.traveuseAcc.Alet:4.0197e-2 CPU 0.040578s总计
<<<<< compileOpenAcc:4.0248e-2 CPU 0.040895s总计
<<< < runAsyncIn.compileAcc:4.0834e-2 CPU 0.04103s TOTAL
>>>>> runAsyncIn.dumpStats
<<<<<< runAsyncIn.dumpStats:0.0 CPU 0s TOTAL
>>>>> runAsyncIn.executeAcc
>>>>> executeAcc
<<<<<< executeAcc:2.87e-4 CPU 0.000403s总计
<< 1 < runAsyncIn.executeAcc:2.87e-4 CPU 0.000488s TOTAL
>>>>> runAsyncIn.collect
<<<<<< runAsyncIn.collect:9.2e-5 CPU 0.000049s总计
<< evalStateT:4.1213e-2 CPU 0.041739s TOTAL
>>>>>流行
<>>>>>执行GC
< << 1 < evalCUDA:4.3308e-2 CPU 0.042893s总计
<< 1 < runAsyncIn.execute:8.5154e-2 CPU 0.084815s TOTAL
<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<运行:8.5372e-2 CPU 0.085035s TOTAL
Array(Z)[1000001.0]
0.085169s

正如我们所看到的,有两个主要问题:从List(Z:.1000000)[1..1000000] :: Vector Double 的评估 69 ms 超过ghci(106ms - 37ms), performGC 超过57 ms <58 ms - 1毫秒)。这两个总结了ghci和编译版本之间的差异。

我想,在编译的程序中,RTS以不同于ghci的方式管理内存,所以分配和gc可以更快。我们也可以只测试下面代码的部分(它根本不需要CUDA):

$ p $ import Data.Array.Accelerate .Array.Sugar
import Data.Time.Clock(diffUTCTime,getCurrentTime)
import System.Mem(performGC)

$ b $ main :: IO()
main = do
measure $ seq(fromList(Z:.1000000)[1..1000000] :: Vector Double)$ return()
measure $ performGC

measure action = do
start< - getCurrentTime
action
end< - getCurrentTime
print $ diffUTCTime end start


结果:
$ b strong> 0.121653s
在ghci和 0.035162s 中的
a编译版本

  • performGC需要 0.044876s ghci和
    0.00031s



  • 这可能是另一个问题,但也许有人知道:我们可以调整垃圾收集器在ghci下快速工作吗?

    Problem

    Hello, I'm using accelerate library to create an application allowing the user to interactively call functions that process images, that's why I'm basing on and extending ghci using ghc api.

    The problem is that when running the compiled executable from the shell the computations are done under 100ms (slightly less than 80), while running the same compiled code within ghci it takes over 100ms (on average a bit more than 140) to finish.

    Resources

    sample code + execution logs: https://gist.github.com/zgredzik/15a437c87d3d8d03b8fc

    Description

    First of all: the tests were ran after the CUDA kernel was compiled (the compilation itself added additional 2 seconds but that's not the case).

    When running the compiled executable from the shell the computations are done in under 10ms. (shell first run and second shell run have different arguments passed to make sure the data wasn't cached anywhere).

    When trying to run the same code from ghci and fiddling with the input data, the computations take over 100ms. I understand that interpreted code is slower than compiled one, but I'm loading the same compiled code within the ghci session and calling the same top level binding (packedFunction). I've explicitly typed it to make sure it is specialized (same results as using the SPECIALIZED pragma).

    However the computations do take less than 10ms if I run the main function in ghci (even when changing the input data with :set args between consecutive calls).

    Compiled the Main.hs with ghc -o main Main.hs -O2 -dynamic -threaded

    I'm wondering where the overhead comes from. Does anyone have any suggestions as to why this is happening?


    A simplified version of the example posted by remdezx :

    {-# LANGUAGE OverloadedStrings #-}
    
    module Main where
    
    import Data.Array.Accelerate as A
    import Data.Array.Accelerate.CUDA as C
    import Data.Time.Clock       (diffUTCTime, getCurrentTime)
    
    main :: IO ()
    main = do
        start <- getCurrentTime
        print $ C.run $ A.maximum $ A.map (+1) $ A.use (fromList (Z:.1000000) [1..1000000] :: Vector Double)
        end   <- getCurrentTime
        print $ diffUTCTime end start
    

    When I compile it and execute it takes 0,09s to finish.

    $ ghc -O2 Main.hs -o main -threaded
    [1 of 1] Compiling Main             ( Main.hs, Main.o )
    Linking main ...
    $ ./main
    Array (Z) [1000001.0]
    0.092906s
    

    But when I precompile it and run in interpreter it takes 0,25s

    $ ghc -O2 Main.hs -c -dynamic
    $ ghci Main
    ghci> main
    Array (Z) [1000001.0]
    0.258224s
    

    解决方案

    I investigated accelerate and accelerate-cuda and put some debug code to measure a time both under ghci and in a compiled, optimised version.

    Results are below, you can see stack trace and execution times.

    ghci run

    $ ghc -O2 -dynamic -c -threaded Main.hs && ghci 
    GHCi, version 7.8.3: http://www.haskell.org/ghc/  :? for help
    …
    Loading package ghc-prim ... linking ... done.
    Loading package integer-gmp ... linking ... done.
    Loading package base ... linking ... done.
    Ok, modules loaded: Main.
    Prelude Main> Loading package transformers-0.3.0.0 ... linking ... done.
    …
    Loading package array-0.5.0.0 ... linking ... done.
    (...)
    Loading package accelerate-cuda-0.15.0.0 ... linking ... done.
    >>>>> run
    >>>>> runAsyncIn.execute
    >>>>>  runAsyncIn.seq ctx
    <<<<<  runAsyncIn.seq ctx: 4.1609e-2 CPU  0.041493s TOTAL
    >>>>>  runAsyncIn.seq a
    <<<<<  runAsyncIn.seq a: 1.0e-6 CPU  0.000001s TOTAL
    >>>>>  runAsyncIn.seq acc
    >>>>>   convertAccWith True
    <<<<<   convertAccWith: 0.0 CPU  0.000017s TOTAL
    <<<<<  runAsyncIn.seq acc: 2.68e-4 CPU  0.000219s TOTAL
    >>>>>  evalCUDA
    >>>>>   push
    <<<<<   push: 0.0 CPU  0.000002s TOTAL
    >>>>>   evalStateT
    >>>>>    runAsyncIn.compileAcc
    >>>>>     compileOpenAcc
    >>>>>      compileOpenAcc.traveuseAcc.Alet
    >>>>>      compileOpenAcc.traveuseAcc.Use
    >>>>>       compileOpenAcc.traveuseAcc.use3
    >>>>>       compileOpenAcc.traveuseAcc.use1
    <<<<<       compileOpenAcc.traveuseAcc.use1: 0.0 CPU  0.000001s TOTAL
    >>>>>       compileOpenAcc.traveuseAcc.use2
    >>>>>        compileOpenAcc.traveuseAcc.seq arr
    <<<<<        compileOpenAcc.traveuseAcc.seq arr: 0.105716 CPU  0.105501s TOTAL
    >>>>>        useArrayAsync
    <<<<<        useArrayAsync: 1.234e-3 CPU  0.001505s TOTAL
    <<<<<       compileOpenAcc.traveuseAcc.use2: 0.108012 CPU  0.108015s TOTAL
    <<<<<       compileOpenAcc.traveuseAcc.use3: 0.108539 CPU  0.108663s TOTAL
    <<<<<      compileOpenAcc.traveuseAcc.Use: 0.109375 CPU  0.109005s TOTAL
    >>>>>      compileOpenAcc.traveuseAcc.Fold1
    >>>>>      compileOpenAcc.traveuseAcc.Avar
    <<<<<      compileOpenAcc.traveuseAcc.Avar: 0.0 CPU  0.000001s TOTAL
    >>>>>      compileOpenAcc.traveuseAcc.Avar
    <<<<<      compileOpenAcc.traveuseAcc.Avar: 0.0 CPU  0s TOTAL
    >>>>>      compileOpenAcc.traveuseAcc.Avar
    <<<<<      compileOpenAcc.traveuseAcc.Avar: 0.0 CPU  0.000001s TOTAL
    >>>>>      compileOpenAcc.traveuseAcc.Avar
    <<<<<      compileOpenAcc.traveuseAcc.Avar: 0.0 CPU  0s TOTAL
    <<<<<      compileOpenAcc.traveuseAcc.Fold1: 2.059e-3 CPU  0.002384s TOTAL
    <<<<<      compileOpenAcc.traveuseAcc.Alet: 0.111434 CPU  0.112034s TOTAL
    <<<<<     compileOpenAcc: 0.11197 CPU  0.112615s TOTAL
    <<<<<    runAsyncIn.compileAcc: 0.11197 CPU  0.112833s TOTAL
    >>>>>    runAsyncIn.dumpStats
    <<<<<    runAsyncIn.dumpStats: 2.0e-6 CPU  0.000001s TOTAL
    >>>>>    runAsyncIn.executeAcc
    >>>>>     executeAcc
    <<<<<     executeAcc: 8.96e-4 CPU  0.00049s TOTAL
    <<<<<    runAsyncIn.executeAcc: 9.36e-4 CPU  0.0007s TOTAL
    >>>>>    runAsyncIn.collect
    <<<<<    runAsyncIn.collect: 0.0 CPU  0.000027s TOTAL
    <<<<<   evalStateT: 0.114156 CPU  0.115327s TOTAL
    >>>>>   pop
    <<<<<   pop: 0.0 CPU  0.000002s TOTAL
    >>>>>   performGC
    <<<<<   performGC: 5.7246e-2 CPU  0.057814s TOTAL
    <<<<<  evalCUDA: 0.17295 CPU  0.173943s TOTAL
    <<<<< runAsyncIn.execute: 0.215475 CPU  0.216563s TOTAL
    <<<<< run: 0.215523 CPU  0.216771s TOTAL
    Array (Z) [1000001.0]
    0.217148s
    Prelude Main> Leaving GHCi.
    

    compiled code run

    $ ghc -O2 -threaded Main.hs && ./Main
    [1 of 1] Compiling Main             ( Main.hs, Main.o )
    Linking Main ...
    >>>>> run
    >>>>> runAsyncIn.execute
    >>>>>  runAsyncIn.seq ctx
    <<<<<  runAsyncIn.seq ctx: 4.0639e-2 CPU  0.041498s TOTAL
    >>>>>  runAsyncIn.seq a
    <<<<<  runAsyncIn.seq a: 1.0e-6 CPU  0.000001s TOTAL
    >>>>>  runAsyncIn.seq acc
    >>>>>   convertAccWith True
    <<<<<   convertAccWith: 1.2e-5 CPU  0.000005s TOTAL
    <<<<<  runAsyncIn.seq acc: 1.15e-4 CPU  0.000061s TOTAL
    >>>>>  evalCUDA
    >>>>>   push
    <<<<<   push: 2.0e-6 CPU  0.000002s TOTAL
    >>>>>   evalStateT
    >>>>>    runAsyncIn.compileAcc
    >>>>>     compileOpenAcc
    >>>>>      compileOpenAcc.traveuseAcc.Alet
    >>>>>      compileOpenAcc.traveuseAcc.Use
    >>>>>       compileOpenAcc.traveuseAcc.use3
    >>>>>       compileOpenAcc.traveuseAcc.use1
    <<<<<       compileOpenAcc.traveuseAcc.use1: 0.0 CPU  0.000001s TOTAL
    >>>>>       compileOpenAcc.traveuseAcc.use2
    >>>>>        compileOpenAcc.traveuseAcc.seq arr
    <<<<<        compileOpenAcc.traveuseAcc.seq arr: 3.6651e-2 CPU  0.03712s TOTAL
    >>>>>        useArrayAsync
    <<<<<        useArrayAsync: 1.427e-3 CPU  0.001427s TOTAL
    <<<<<       compileOpenAcc.traveuseAcc.use2: 3.8776e-2 CPU  0.039152s TOTAL
    <<<<<       compileOpenAcc.traveuseAcc.use3: 3.8794e-2 CPU  0.039207s TOTAL
    <<<<<      compileOpenAcc.traveuseAcc.Use: 3.8808e-2 CPU  0.03923s TOTAL
    >>>>>      compileOpenAcc.traveuseAcc.Fold1
    >>>>>      compileOpenAcc.traveuseAcc.Avar
    <<<<<      compileOpenAcc.traveuseAcc.Avar: 2.0e-6 CPU  0.000001s TOTAL
    >>>>>      compileOpenAcc.traveuseAcc.Avar
    <<<<<      compileOpenAcc.traveuseAcc.Avar: 2.0e-6 CPU  0.000001s TOTAL
    >>>>>      compileOpenAcc.traveuseAcc.Avar
    <<<<<      compileOpenAcc.traveuseAcc.Avar: 0.0 CPU  0.000001s TOTAL
    >>>>>      compileOpenAcc.traveuseAcc.Avar
    <<<<<      compileOpenAcc.traveuseAcc.Avar: 0.0 CPU  0.000001s TOTAL
    <<<<<      compileOpenAcc.traveuseAcc.Fold1: 1.342e-3 CPU  0.001284s TOTAL
    <<<<<      compileOpenAcc.traveuseAcc.Alet: 4.0197e-2 CPU  0.040578s TOTAL
    <<<<<     compileOpenAcc: 4.0248e-2 CPU  0.040895s TOTAL
    <<<<<    runAsyncIn.compileAcc: 4.0834e-2 CPU  0.04103s TOTAL
    >>>>>    runAsyncIn.dumpStats
    <<<<<    runAsyncIn.dumpStats: 0.0 CPU  0s TOTAL
    >>>>>    runAsyncIn.executeAcc
    >>>>>     executeAcc
    <<<<<     executeAcc: 2.87e-4 CPU  0.000403s TOTAL
    <<<<<    runAsyncIn.executeAcc: 2.87e-4 CPU  0.000488s TOTAL
    >>>>>    runAsyncIn.collect
    <<<<<    runAsyncIn.collect: 9.2e-5 CPU  0.000049s TOTAL
    <<<<<   evalStateT: 4.1213e-2 CPU  0.041739s TOTAL
    >>>>>   pop
    <<<<<   pop: 0.0 CPU  0.000002s TOTAL
    >>>>>   performGC
    <<<<<   performGC: 9.41e-4 CPU  0.000861s TOTAL
    <<<<<  evalCUDA: 4.3308e-2 CPU  0.042893s TOTAL
    <<<<< runAsyncIn.execute: 8.5154e-2 CPU  0.084815s TOTAL
    <<<<< run: 8.5372e-2 CPU  0.085035s TOTAL
    Array (Z) [1000001.0]
    0.085169s
    

    As we can see there are two major problems: evaluation of fromList (Z:.1000000) [1..1000000] :: Vector Double which takes 69 ms extra under ghci (106ms - 37ms), and performGC call which takes 57 ms extra (58 ms - 1 ms). These two sum up to the difference between execution under ghci and in a compiled version.

    I suppose, that in compiled program, RTS manage memory in a different way than in ghci, so allocation and gc can be faster. We can also test only this part evaluating below code (it does not require CUDA at all):

    import Data.Array.Accelerate.Array.Sugar
    import Data.Time.Clock                   (diffUTCTime, getCurrentTime)
    import System.Mem                        (performGC)
    
    
    main :: IO ()
    main = do
        measure $ seq (fromList (Z:.1000000) [1..1000000] :: Vector Double) $ return ()
        measure $ performGC
    
    measure action = do
        start <- getCurrentTime
        action
        end   <- getCurrentTime
        print $ diffUTCTime end start
    

    Results:

    • evaluating vector takes 0.121653s under ghci and 0.035162s in a compiled version
    • performGC takes 0.044876s under ghci and 0.00031s in a compiled version.

    This could be another question, but maybe someone know: Can we tune somehow garbage collector to work faster under ghci?

    这篇关于编译加速代码的性能差异来自ghci和shell的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

    查看全文
    登录 关闭
    扫码关注1秒登录
    发送“验证码”获取 | 15天全站免登陆