GHC可以在严格的数据字段上枚举枚举吗? [英] Can GHC unpack enumerations on strict data fields?

查看:130
本文介绍了GHC可以在严格的数据字段上枚举枚举吗?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

上一个问题的评论中,我声称:


我有另一个基准来表明ghc-7.4.1 + llvm会对严格的数据字段的枚举进行解包。

$实际上,在经过一些实验之后,我相信至少在一些简单的情况下,使用枚举的
至少与一些枚举一样快,并且实际上可能会比使用新类型的Word8(或Int)更高的内存效率(因此在更现实的应用程序中更快),即使在严格的数据类型字段中使用时也是如此。正如我在前一个问题中所说的,我在一个更现实的(但仍然很小)的环境中经历了类似的现象。

任何人都可以指出一些相关的参考资料,指出ghc / llvm对枚举做了哪些优化?特别是,它是否真的在枚举的内部标签上打包了严格的数据字段?程序集的输出和分析的结果似乎表明,情况就是这样,但对我而言,核心级别没有迹象。任何见解都将不胜感激。

还有一个问题:
枚举总是至少与对应的Integral的newtypes一样高效,在哪里使用它们是有道理的? (请注意枚举也可以像积分一样。)如果不是,那么(希望实际上有用的)异常是什么?丹尼尔菲舍尔在他的回答中建议将一个枚举置于严格多构造函数数据类型的字段可能会阻止某些优化。但是我没有在双构造函数中验证这一点。将它们放入大型多重构造函数数据类型中时可能有区别吗?

我也很好奇下面的基准测试究竟发生了什么。在所有这四种情况下,在堆中分配的字节大致相同。但是,对于枚举,GC实际上没有复制,并且与newtypes相比,最大驻留次数更小。



(其实我真正的问题是,值得当性能很重要时,尝试将枚举转换为新类型?,但我认为更具体一些可能会更有帮助。)



这个问题会是这样的:如果你的程序中使用了大量的Int,它们在一些非常小的子集上实际上有所不同,那么将它们改为枚举(而不是非盒装类型)可能会提高性能(但要小心为了严格性)。

以下是基准测试的总结,其次是基准测试代码和用于测试系统的便捷生成文件。

 
基准d
的平均值:11.09113 ns,lb 11.06140 ns,ub 11.17545 ns,ci 0.950
std dev:234.6722 ps,lb 72.31532 ps,ub 490.1156 ps,ci 0.950

基准e
的意思是:11.54242 ns,lb 11.51789 ns,ub 11.59720 ns,ci 0.950
std dev:178.8556 ps,lb 73.05290 ps,ub 309.0252 ps,ci 0.950

基准s
平均值:11.74964 ns,lb 11.52543 ns,ub 12.50447 ns,ci 0.950
std dev:1.803095 ns,lb 207.2720 ps,ub 4.029809 ns,ci 0.950

基准t
意思是:11.89797 ns,lb 11.86266 ns,ub 11.99105 ns,ci 0.950
std dev:269.5795 ps,lb 81.65093 ps,ub 533.8658 ps,ci 0.950

好​​的,所以枚举出现在至少不低于newtype的效率
接下来,函数的堆配置文件
heapTest x = print $ head $ force $ reverse $ take 100000 $ iterate(force。 succ')x

data D = A | B | C:
在堆中分配的10,892,604字节
在GC
中复制6,401,260字节最大居民地址为1,396,092字节(3个样本)
55,940字节最大值
6 MB正在使用的总内存(由于碎片造成的0 MB丢失)
生产力总用户的47.8%,已用完总数的35.4%

新类型E = E字8:
分配11,692,768字节堆
在GC
期间复制的8,909,632字节最大居住地(2个样本)$ 2,779,776字节
92,464字节最大坡度
使用中的总内存量(0 MB由于碎片)
生产力总用户的36.9%,占已用总数的33.8%

数据S = S!D:
堆中分配的10,892,736字节
复制6,401,260字节在GC
期间最大居民身份(3个样本)
55,940字节最大坡度
总共6 MB内存使用总数(0 MB由于f而丢失碎片)
生产率总用户的48.7%,占已用总数的33.3%

data T = T { - #UNPACK# - }!E:
堆中分配的11,692,968字节
在GC
中复制8,909,640字节最大居民身份2,779,760字节(3个样本)
92,536字节最大值
使用中的总内存(由于碎片而丢失0 MB)
生产力总用户的36.1%,已用完总数的31.6%

双构造器可以获得类似的性能增益例如。
$ b

基准代码
{ - #LANGUAGE CPP,MagicHash,BangPatterns,GeneralizedNewtypeDivingiving# - }
module Main(main,d,e,s,t,D(..),E(..), S(..),T(..))
where
import GHC.Base
import Data.List
import Data.Word
import Control.DeepSeq
import Criterion.Main

data D = A | B | (Eq,Ord,Show,Enum,Bounded)
newtype E = E Word8派生(Eq,Ord,Show,Enum)

数据S = S!D派生(Eq, Ord,Show)
data T = T { - #UNPACK# - }!E派生(Eq,Ord,Show)

- 我假设以下定义都是正确的---否则
- 整个基准可能是无用的
实例NFData D其中
rnf!x =()
实例NFData E其中
rnf(E!x)= ()
实例NFData S其中
rnf(S!x)=()
实例NFData T其中
rnf(T(E!x))=()

实例枚举S,其中
到Enum = S。 toEnum
fromEnum(S x)= fromEnum x
实例Enum T其中
toEnum = T。 toEnum
fromEnum(T x)= fromEnum x

实例有界E其中
minBound = E 0
maxBound = E 2
实例有界S其中
minBound = S minBound
maxBound = S maxBound
实例有界T其中
minBound = T minBound
maxBound = T maxBound

succ': :(方程a,枚举a,有界a)=> a - > a
succ'x | x == maxBound = minBound
|否则= succ x

- 下面这些数字是为了方便浏览汇编代码
d :: D - > Int#
dx = case x of
A - > 1234#
B - > 5678#
C - > 9412#

e :: E - > Int#
ex = case x of
E 0 - > 1357#
E 1 - > 2468#
E - - > 9914#

s :: S - > Int#
sx = case x of
SA - > 9876#
SB - > 5432#
SC - > 1097#

t :: T - > Int#
tx = case x of
T(E 0) - > 9630#
T(E 1) - > 8529#
T(E _) - > 7418#


基准: :IO()
benchmark = defaultMain [长凳d$ whnf d'A
,长凳e$ whnf e'(E 0)
,长凳s$ whnf s '(SA)
,bencht$ whnf t'(T(E 0))
]
其中
d'x = I#(dx)
e'x = I#(ex)
s'x = I#(sx)
t'x = I#(tx)

heapTest ::(N FData a,显示a,公式a,枚举a,有界a)=> a - > IO()
heapTest x = print $ head $ force $ reverse $ take 100000 $ iterate(force。 succ')x

main :: IO()
main =
#if defined TEST_D
heapTest(A :: D)
#elif defined TEST_E
heapTest(E 0 :: E)
#elif defined TEST_S
heapTest(SA :: S)
#elif defined TEST_T
heapTest(T(E 0 ):: T)
#else
基准
#endif

- 小调:
- 为了可靠的统计,我希望Criterion能够以*随机顺序*,
运行代码 - 至少用于比较具有相同类型的函数。在我的系统上耗费的时间太多了
- 嘈杂的结论。

用于基准测试的makefile:

<$如果你不喜欢输出程序集中的ATT语法,使用下面的命令:-fllvm -optlc - 如果你不喜欢输出程序集中的ATT语法,使用:-fllvm -optlc - x86-asm-syntax = intel
GHC_DEBUG_FLAGS = -keep -s -file -keep -llvm -file#-optlc -x86-asm-syntax = intel
GHCFLAGS = -O2 -funbox-strict-字段-rtsopts -fllvm -fwarn-missing-signatures
GHC_MAKE = $(GHC)--make $(GHCFLAGS)
GHC_PROF_MAKE = $(GHC)-prof -auto-all -caf-all - make $(GHCFLAGS)

all:benchmark enumtest_all

enumtest_d:EnumTest.hs
$(GHC_MAKE)-o $ @ $ ^ -DTEST_D

enumtest_e:EnumTest.hs
$(GHC_MAKE)-o $ @ $ ^ -DTEST_E

enumtest_s:EnumTest.hs
$(GHC_MAKE)-o $ @ $ ^ -DTEST_S

enumtest_t:EnumTest.hs
$(GHC_MAKE)-o $ @ $ ^ -DTEST_T

enumtest_all:enumtest_d enumtest_e enumtest_s enumtest_t
为$ ^中的x;做./$$x + RTS -sstderr;完成

基准:EnumTest
时间./$^

%:%.hs
$ (GHC_MAKE)-o $ @ $ ^

%.core:%.hs
$(GHC)-S $(GHCFLAGS)$(GHC_DEBUG_FLAGS)-ddump-simpl -dsuppress-all -dsuppress-coercions -ddump-stranal $ ^> $ @

clean:
rm * .hi * .o * .core * .s enumtest_? ;真

非常感谢!

解决方案

第一个


Daniel Fischer在他的回答中建议将枚举放在一个严格的mul​​tiple -constructor数据类型可能会阻止某些优化。


您误解了这一点。如果你有一个类型的构造函数 C ,那么这个类型是否有多个构造函数或者只有一个构造函数并不重要,它的类型是 T , C ...!T ... ,那么如果 T 是一个不可压缩类型的新类型包装器,但如果 T 是一个枚举类型,则不会。原则上可以解压类型为 T 的构造函数标签,但GHC只是不这样做(这可能是有原因的,它可能比我看到的更复杂)。尽管如此,对于足够小的枚举类型,Mikhail Gushenkov提到的指针标记应该或多或少具有相同的效果(不完全相同,或许)。

但是对于枚举类型有3或7以上(对于64位字)的构造函数,在某些情况下需要跟随指针,应该体现。



其实,我真正的问题是,


em>是否值得在性能问题时尝试将枚举转换为新类型?


有时。



转换是否会实际上提高性能以及转换多少取决于您对值所做的操作。它也可能会让你的程序变慢。它可能会使用更多的内存(见下文)。



没有一般规则,每个案例都需要评估。有些模式中newtype-wrapped Int s更快,有些模式更慢。一个典型的程序将包含两者的实例,并且必须找出哪个占优势。






现在转到基准。 p>

我冒昧地改变了基准中的参数,使用 C 而不是 A E 2 而不是 E 0 。结果是,这些新类型的速度稍快一些:

预热
估算时钟分辨率...
的平均值是1.549612美元(640001次迭代)
发现了639999个样本中的4506个异常值(0.7%)
3639(0.6%)高度严重
估算时钟调用...
的平均值是39.24624 ns(12次迭代)
在12个样本中发现了2个异常值(16.7%)
1(8.3%)低温
1(8.3% )高度严重

基准d
平均值:12.12989 ns,lb 12.01136 ns,ub 12.32002 ns,ci 0.950 $ b $ st std dev:755.9999 ps,lb 529.5348 ps,ub 1.034185 ns, ci 0.950
发现100个样本中有17个异常值(17.0%)
17(17.0%)异常值引入的高度严重
差异:59.503%
差异被异常值严重夸大

基准e
意味着:10.82692 ns,lb 10.73286 ns,ub 10.98045 ns,ci 0.950 $ b $ std dev:604.1786 ps,lb 416.5018 ps,ub 871.0923 ps,ci 0.950
发现了10个异常值100%样本(10.0%)
4(4.0%)高度温和
6(6.0%)高度严重
异常引入:53.482%
差异被异常值严重夸大

基准s
的平均值:13.18192 ns,lb 13.11898 ns,ub 13.25911 ns,ci 0.950 $ b $ std dev:354.1332 ps,lb 300.2860 ps,ub 406.2424 ps,ci 0.950
在100个样本中发现13个异常值(13.0%)
13(13.0%)偏离值引入的高温和
差异:20.952%
差异被偏离值适度夸大

基准t
平均值:11.16461 ns,lb 11.02716 ns,ub 11.37018 ns,ci 0.950 $ b $ st std dev:853.2152 ps,lb 602.5197 ps,ub 1.086899 ns,ci 0.950
found 14 100%样本中的异常值(14.0%)
3(3.0%)高温和
11(11.0%)高度严重
异常引起的异常:68.689%
差异严重夸大异常值

因此,基准线在两种情况下都没有显着差异,总体结果取决于s提出的论点。用 B 和resp。 E 1 ,差异较小,但在我的跑步中,newtype也赢了。

一个时钟调用的估计成本大约是其中任何一个的平均值的四倍,并且估计的时钟分辨率超过100倍。我不相信这些结果是可靠的。考虑到我在我的系统上观察到的短基准差异,我不相信任何运行时间小于10微秒的基准测试。



关于

heapTest x = print $ head $ force $ reverse $ take 100000 $ iterate(force。succ')x

不幸的是, iterate(force。succ')不会强制列表中的元素,因此您可以得到thunks(深度越来越大),颠倒它的初始段,然后强制列表元素。



因此,绝大多数工作和在那做的分配是thunk和建筑名单,然后评估thunk。如果您通过强制元素生成大元素来阻止构建大元素,您将获得更有意义的结果。

  iterate': :(a  - > a) - > a  - > [a] 
迭代'f!a = a:iterate'f(fa)

(一个爆炸模式 - WHNF - 就足以完全评估所讨论类型的值)。

然而,在分配数字之间的可观察和一致的差异枚举和新类型变体,并且还保留着 iterate'。而且,如果不是颠倒初始段并取得头部,那么只需要 list!索引,其中它变得更加令人印象深刻,因为其他数字(复制的字节数,最大居留权)则很小。原因是该列表元素不能被拆箱,因此列表单元格包含指向元素值的指针,并且枚举值是共享的(在整个程序中只存在一个 A ),所以所有这些指针指向相同的对象,但整数不共享,因此每个列表单元格指向一个不同的对象。



分配的差异在您的系统上差不多正好800,000字节,在我的1,600,000。

这就是200,000字的需要,什么分配100,000 > Word8 (或 Word ; Int ,...)需要(每个值,一个单词用于构造函数,另一个用于 Word# Int#)。


In a comment on the previous question, I claimed:

I have another benchmark to indicate ghc-7.4.1+llvm does unpacking of enumerations on strict data fields.

In fact, after some experimentation, I believe that, in at least some simplistic cases, using an enumeration is at least as fast as, and actually might be more memory efficient (hence faster in a more realistic application) than using a newtype of Word8 (or Int), even when used in a strict field of a datatype. As I said in the previous question, I have experienced the similar phenomenon in a more realistic (but still small) setting.

Can anyone point me to some relevant references as to what optimizations ghc/llvm does for enumerations? In particular, does it really unpack the internal tag of the enumeration on a strict data field? The assembly output and the result of profiling seem to suggest that that is the case, but to me there is no indication on the core level. Any insights will be greatly appreciated.

One more question: Are enumerations always at least as efficient as newtypes of the corresponding Integral, where using them makes sense? (Note that enumerations can behave like Integrals,too.) If not, what is the (hopefully realistically useful) exception? Daniel Fischer suggested in his answer that putting an enumeration on a strict field of a multiple-constructor datatype might prevent certain optimizations. However I have failed to verify this in the two-constructor case. Maybe there is a difference when putting them in a large multiple-constructor datatypes?

I'm also curious about what exactly is happening in the following benchmark. In all the four cases, roughly the same amount of bytes were allocated in the heap. However, for enumerations, the GC actually did less copying and the maximum residencies were smaller compared to newtypes.

(Actually my real question was, is it worthwhile to to try and convert enumerations to newtypes when performance matters? , but I assumed that being a bit more specific might be more helpful.)

A possible implication of this question would be this: If you are using a large number of Int's in your program which actually varies on some very small subset, then changing them to an enumeration (rather than to an unboxed type!) might be a performance gain (but careful for strictness).

Here is the summary of the benchmark, followed by the benchmark code and the convenience makefile for testing on your system.

benchmarking d
mean: 11.09113 ns, lb 11.06140 ns, ub 11.17545 ns, ci 0.950
std dev: 234.6722 ps, lb 72.31532 ps, ub 490.1156 ps, ci 0.950

benchmarking e
mean: 11.54242 ns, lb 11.51789 ns, ub 11.59720 ns, ci 0.950
std dev: 178.8556 ps, lb 73.05290 ps, ub 309.0252 ps, ci 0.950

benchmarking s
mean: 11.74964 ns, lb 11.52543 ns, ub 12.50447 ns, ci 0.950
std dev: 1.803095 ns, lb 207.2720 ps, ub 4.029809 ns, ci 0.950

benchmarking t
mean: 11.89797 ns, lb 11.86266 ns, ub 11.99105 ns, ci 0.950
std dev: 269.5795 ps, lb 81.65093 ps, ub 533.8658 ps, ci 0.950

OK,so the enumeration appears at least no less efficient than the newtype
Next,heap profiles of the function
heapTest x = print $ head $ force $ reverse $ take 100000 $ iterate (force . succ') x

data    D = A | B | C:
      10,892,604 bytes allocated in the heap
       6,401,260 bytes copied during GC
       1,396,092 bytes maximum residency (3 sample(s))
          55,940 bytes maximum slop
               6 MB total memory in use (0 MB lost due to fragmentation)
  Productivity  47.8% of total user, 35.4% of total elapsed

newtype E = E Word8:
      11,692,768 bytes allocated in the heap
       8,909,632 bytes copied during GC
       2,779,776 bytes maximum residency (3 sample(s))
          92,464 bytes maximum slop
               7 MB total memory in use (0 MB lost due to fragmentation)
  Productivity  36.9% of total user, 33.8% of total elapsed

data  S = S !D:
      10,892,736 bytes allocated in the heap
       6,401,260 bytes copied during GC
       1,396,092 bytes maximum residency (3 sample(s))
          55,940 bytes maximum slop
               6 MB total memory in use (0 MB lost due to fragmentation)
  Productivity  48.7% of total user, 33.3% of total elapsed

data  T = T {-# UNPACK #-} !E:
      11,692,968 bytes allocated in the heap
       8,909,640 bytes copied during GC
       2,779,760 bytes maximum residency (3 sample(s))
          92,536 bytes maximum slop
               7 MB total memory in use (0 MB lost due to fragmentation)
  Productivity  36.1% of total user, 31.6% of total elapsed

A similar performance gain can be obtained in the two-constructor case.

The benchmark code(save it as EnumTest.hs):


{-# LANGUAGE CPP,MagicHash , BangPatterns ,GeneralizedNewtypeDeriving #-}
module Main(main,d,e,s,t,D(..),E(..),S(..),T(..))
where       
import GHC.Base  
import Data.List
import Data.Word
import Control.DeepSeq
import Criterion.Main

data    D = A | B | C  deriving(Eq,Ord,Show,Enum,Bounded)
newtype E = E Word8    deriving(Eq,Ord,Show,Enum)

data    S = S                !D deriving (Eq,Ord,Show) 
data    T = T {-# UNPACK #-} !E deriving (Eq,Ord,Show)

-- I assume the following definitions are all correct --- otherwise
-- the whole benchmark may be useless
instance NFData D where
  rnf !x         = ()
instance NFData E where
  rnf (E !x)     = ()
instance NFData S where
  rnf (S !x)     = ()
instance NFData T where
  rnf (T (E !x)) = ()  

instance Enum S where
  toEnum         = S . toEnum
  fromEnum (S x) = fromEnum x 
instance Enum T where
  toEnum         = T . toEnum
  fromEnum (T x) = fromEnum x 

instance Bounded E where
  minBound = E 0
  maxBound = E 2
instance Bounded S where
  minBound = S minBound
  maxBound = S maxBound
instance Bounded T where
  minBound = T minBound
  maxBound = T maxBound

succ' :: (Eq a,Enum a,Bounded a) => a -> a
succ' x | x == maxBound = minBound
          | otherwise        = succ x

-- Those numbers below are for easy browsing of the assembly code
d :: D -> Int#
d x = case x of
  A -> 1234#
  B -> 5678#
  C -> 9412#

e :: E -> Int#
e x = case x of     
  E 0 -> 1357#
  E 1 -> 2468#
  E _ -> 9914#

s :: S -> Int#
s x = case x of     
  S A -> 9876#
  S B -> 5432#
  S C -> 1097#

t :: T -> Int#
t x = case x of     
  T (E 0) -> 9630#
  T (E 1) -> 8529#
  T (E _) -> 7418#


benchmark :: IO ()
benchmark = defaultMain [ bench "d" $ whnf d' A
                        , bench "e" $ whnf e' (E 0)
                        , bench "s" $ whnf s' (S A) 
                        , bench "t" $ whnf t' (T (E 0))
                        ]
  where
    d' x = I# (d x)
    e' x = I# (e x)
    s' x = I# (s x)
    t' x = I# (t x)    

heapTest :: (NFData a,Show a,Eq a,Enum a,Bounded a) => a -> IO ()
heapTest x = print $ head $ force $ reverse $ take 100000 $ iterate (force . succ') x

main :: IO ()
main =    
#if   defined TEST_D
     heapTest (A :: D)
#elif defined TEST_E                
     heapTest (E 0 :: E)
#elif defined TEST_S
     heapTest (S A :: S)
#elif defined TEST_T                
     heapTest (T (E 0) :: T)
#else
     benchmark
#endif

-- A minor rant: 
-- For reliable statistics, I hope Criterion will run the code in *random order*,
-- at least for comparing functions with the same type. Elapsed times on my system are just too
-- noisy to conclude anything.

The makefile used for the benchmark:


GHC=/usr/bin/ghc
# If you dont't like the ATT syntax in the output assembly, use this: -fllvm -optlc --x86-asm-syntax=intel
GHC_DEBUG_FLAGS= -keep-s-file -keep-llvm-file  # -optlc --x86-asm-syntax=intel
GHCFLAGS=-O2 -funbox-strict-fields -rtsopts -fllvm -fwarn-missing-signatures
GHC_MAKE=$(GHC) --make $(GHCFLAGS)
GHC_PROF_MAKE=$(GHC)  -prof  -auto-all -caf-all --make $(GHCFLAGS)

all : benchmark enumtest_all

enumtest_d : EnumTest.hs
    $(GHC_MAKE) -o $@ $^ -DTEST_D

enumtest_e : EnumTest.hs
    $(GHC_MAKE) -o $@ $^ -DTEST_E

enumtest_s : EnumTest.hs
    $(GHC_MAKE) -o $@ $^ -DTEST_S

enumtest_t : EnumTest.hs
    $(GHC_MAKE) -o $@ $^ -DTEST_T

enumtest_all : enumtest_d enumtest_e enumtest_s enumtest_t
    for x in $^; do ./$$x +RTS -sstderr ;done

benchmark : EnumTest
    time ./$^

% : %.hs
    $(GHC_MAKE) -o $@ $^

%.core : %.hs
    $(GHC)  -S  $(GHCFLAGS)   $(GHC_DEBUG_FLAGS) -ddump-simpl -dsuppress-all -dsuppress-coercions -ddump-stranal $^ > $@

clean :
    rm *.hi *.o *.core *.s enumtest_? ; true

Thank you very much!

解决方案

First

Daniel Fischer suggested in his answer that putting an enumeration on a strict field of a multiple-constructor datatype might prevent certain optimizations.

you misunderstood that. If you have a constructor C of a type, doesn't matter whether that type has multiple constructors or just one, with a strict field of type T, C ... !T ..., then the strict field can be unpacked if T is a newtype wrapper around an unpackable type, but not if T is an enumeration. It should in principle be possible to unpack the constructor tag of the value of type T too, but GHC just doesn't do that (there's probably a reason for that, it may be more complicated than I see). Nevertheless, for small enough enumeration types, the pointer tagging mentioned by Mikhail Gushenkov should have more or less the same effect (not quite, perhaps).

But for enumeration types with more than 3 or 7 (for 64-bit words) constructors, where following the pointer is necessary in some cases, the difference should manifest.

Second, the short answer to

Actually my real question was, is it worthwhile to to try and convert enumerations to newtypes when performance matters?

Sometimes.

Whether a conversion would actually improve performance, and by how much, depends on what you are doing with the values. It could also make your programme slower. And it may use more memory (see below).

There is no general rule, each case needs to be evaluated. There are patterns where newtype-wrapped Ints are faster, there are patterns where they are slower. A typical programme would contain instances of both, and one must find out which dominate.


Now to the benchmark.

I took the liberty of changing the arguments in the benchmark, using C instead of A and E 2 instead of E 0. The results were that the newtype was slightly faster for these:

warming up
estimating clock resolution...
mean is 1.549612 us (640001 iterations)
found 4506 outliers among 639999 samples (0.7%)
  3639 (0.6%) high severe
estimating cost of a clock call...
mean is 39.24624 ns (12 iterations)
found 2 outliers among 12 samples (16.7%)
  1 (8.3%) low mild
  1 (8.3%) high severe

benchmarking d
mean: 12.12989 ns, lb 12.01136 ns, ub 12.32002 ns, ci 0.950
std dev: 755.9999 ps, lb 529.5348 ps, ub 1.034185 ns, ci 0.950
found 17 outliers among 100 samples (17.0%)
  17 (17.0%) high severe
variance introduced by outliers: 59.503%
variance is severely inflated by outliers

benchmarking e
mean: 10.82692 ns, lb 10.73286 ns, ub 10.98045 ns, ci 0.950
std dev: 604.1786 ps, lb 416.5018 ps, ub 871.0923 ps, ci 0.950
found 10 outliers among 100 samples (10.0%)
  4 (4.0%) high mild
  6 (6.0%) high severe
variance introduced by outliers: 53.482%
variance is severely inflated by outliers

benchmarking s
mean: 13.18192 ns, lb 13.11898 ns, ub 13.25911 ns, ci 0.950
std dev: 354.1332 ps, lb 300.2860 ps, ub 406.2424 ps, ci 0.950
found 13 outliers among 100 samples (13.0%)
  13 (13.0%) high mild
variance introduced by outliers: 20.952%
variance is moderately inflated by outliers

benchmarking t
mean: 11.16461 ns, lb 11.02716 ns, ub 11.37018 ns, ci 0.950
std dev: 853.2152 ps, lb 602.5197 ps, ub 1.086899 ns, ci 0.950
found 14 outliers among 100 samples (14.0%)
  3 (3.0%) high mild
  11 (11.0%) high severe
variance introduced by outliers: 68.689%
variance is severely inflated by outliers

So the benchmark shows no substantial difference in either case, and the overall outcome depends on the supplied arguments. With B resp. E 1, the difference was smaller, but in my run, the newtype won there too.

Note however, that the estimated cost of a clock call is about four times as large as the mean for any of these, and the estimated clock resolution more than 100 times. I am not convinced that these results are reliable. Considering the variance I observe on my system for short benchmarks, I do not trust benchmarks for anything that runs less than 10 microseconds. I prefer tests running longer because the results are more stable and fewer outliers are produced.

Regarding

heapTest x = print $ head $ force $ reverse $ take 100000 $ iterate (force . succ') x

unfortunately, iterate (force . succ') doesn't force the elements of the list as it is constructed, so you get a list of thunks (of increasing depth), reverse the initial segment of that, and then force the list elements.

So the overwhelming part of work and allocation done in that is the construction of thunks and the list, and then the evaluation of the thunks. You get a more meaningful result if you prevent the building of large thunks by forcing the elements as they are generated,

iterate' :: (a -> a) -> a -> [a]
iterate' f !a = a : iterate' f (f a)

(a bang pattern - WHNF - is enough to completely evaluate values of the types in question).

Still, there is the observable and consistent difference in the allocation figures between the enumeration and the newtype variants, and that remains also with iterate'. And also if instead of reversing an initial segment and taking the head of that, one simply takes list !! index, where it becomes even more impressive because the other figures (copied bytes, maximum residency) then are small.

The reason is that list elements can't be unboxed, so the list cells contain pointers to the element values, and the enumeration values are shared (there exists only one A in the entire programme), so all these pointers point to the same object, but integers aren't shared, so each list cell points to a different object.

The difference in allocation is on your system almost exactly 800,000 bytes, on mine 1,600,000.

That is what 200,000 words need, what allocation of 100,000 Word8 (or Word; Int, ...) requires (per value, one word for the constructor, and one for a Word# or Int#).

这篇关于GHC可以在严格的数据字段上枚举枚举吗?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆