为什么(地图digitToInt)。表演`是如此之快? [英] Why `(map digitToInt) . show` is so fast?

查看:176
本文介绍了为什么(地图digitToInt)。表演`是如此之快?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

将非负数整数转换为数字列表通常是这样做的:

  import Data.Char 

digits :: Integer - > [Int]
digits =(地图digitToInt)。显示

我试图找到更直接的方式来执行任务,而不涉及字符串转换,但我无法想出更快的办法。



到目前为止我一直在尝试的事情:



基准:

  digits :: Int  - > [Int] 
digits =(地图digitToInt)。显示

从StackOverflow上的另一个问题得到这一个:

  digits2 :: Int  - > [Int] 
digits2 = map(`mod` 10)。相反。 takeWhile(> 0)。迭代(`div` 10)

尝试推出自己的产品:

  digits3 :: Int  - > [Int] 
digits3 =反向。 revDigits3

revDigits3 :: Int - > [Int]
revDigits3 n = case divMod n 10 of
(0,digit) - > [数字]
(其余,数字) - > digit:(revDigits3 rest)

这一个灵感源于 showInt Numeric

  digits4 n0 = go n0 [ ]其中
转到n cs
| n< 10 = n:cs
|否则= go q(r:cs)
其中
(q,r)= n`quotRem` 10

现在是基准。注意:我正在使用 filter 来强制评估。

 λ> :set + s 
λ>长度$ filter(> 5)$ concat $ map(digits)[1..1000000]
2400000
(1.58 secs,771212628 bytes)

这是参考资料。现在为 digits2

 λ>长度$ filter(> 5 )$ concat $ map(digits2)[1..1000000] 
2400000
(5.47 secs,1256170448 bytes)

即3.46倍长。

 λ>长度$过滤(> 5)$ concat $ map(digits3)[1..1000000] 
2400000
(7.74秒,1365486528字节)

digits3 4.89 的时间较慢。为了好玩,我试着只使用revDigits3并避免 reverse

 λ>长度$ filter(> 5)$ concat $ map(revDigits3)[1..1000000] 
2400000
(8.28秒,1277538760字节)

奇怪的是,这更慢,5.24倍慢。

以及最后一个:

pre $ λ> length $ filter( > 5)$ concat $ map(digits4)[1..1000000]
2400000
(16.48 secs,1779445968 bytes)

这是 10.43 的时间比较慢。

我以为只使用算术和利弊将超越涉及字符串转换的任何事情。显然,有些东西我无法理解。

那么诀窍是什么?为什么数字如此之快?



我正在使用GHC 6.12.3。

解决方案

鉴于我不能添加评论,我会做更多的工作,并对其中的所有内容进行分析。我将分析置于顶端;但是,相关数据如下。 (注意:所有这一切都是在6.12.3中完成的 - 没有GHC 7的魔法。)






分析:

版本1:显示对于整数非常有用,特别是那些与我们一样短的显示。制作琴弦在GHC中实际上很体面;然而,读取字符串并将大字符串写入文件(或stdout,尽管您不想这样做)是您的代码绝对可以抓取的位置。我怀疑很多背后的细节背后的细节都是因为Ints展会内部的巧妙优化。

这是编译时最慢的一个。一些问题:反向在其论据中是严格的。这意味着当你计算下一个元素时,你不能从列表的第一部分执行计算;你必须全部计算它们,翻转它们,然后对列表中的元素进行计算(即(mod'10))。虽然这可能看起来很小,但它可能会导致更大的内存使用量(请注意这里分配的5GB堆内存)以及较慢的计算。 (长话短说:不要使用反向。)



第3版:记住我刚才说的不要使用反向?结果,如果你把它拿出来,这个总执行时间会下降到1.79s--几乎比基线慢。这里唯一的问题是,当你更深入地了解数字时,你会在错误的方向上建立列表的脊柱(本质上,你正在考虑将递归列入列表,而不是该列表)。



第4版:这是一个非常聪明的实现。您可以从几件好事中受益:例如,quotRem应该使用欧几里德算法,该算法在其较大的参数中是对数的。 (也许速度更快,但我不相信比欧几里得更快的东西不仅仅是一个恒定的因素)。此外,你上一次讨论的时候会列出名单,这样你就不必像你一样解决任何列表中的连词去 - 当你回来解析它时,这个列表已经完全构建完成。正如你所看到的那样,性能带来了好处。

这个代码可能是GHCi中最慢的代码,因为GHC中的-O3标志进行了很多优化处理使得名单更快,而GHCi不会这样做。



课程:以正确的方式列入清单,注意中级严格这可能会减慢计算速度,并在查看代码性能的细粒度统计信息时进行一些操作。也可以使用-O3标志进行编译:只要你不这样做,所有那些花费大量时间让GHC超快速运行的人就会在你身上得到大的眼睛。





数据:



我只用了所有四项功能, hs文件,然后根据需要进行更改以反映所使用的功能。另外,我将你的极限值提高到了5e6,因为在某些情况下,编译后的代码在1e6时间内运行时间不到半秒,这可能导致我们正在测量的粒度问题。



编译器选项:使用 ghc --make -O3 [filename] .hs 使GHC进行一些优化。我们会使用 digits + RTS -sstderr 将统计信息转储为标准错误。



转储到-sstderr会为我们提供类似于以下内容的输出digit1的情况:

  digits1 + RTS -sstderr 
12000000
在堆中分配的2,885,827,628字节$在GC
期间复制的b $ b 446,080字节最大居民地址为3,224字节(1个样本)
12,100字节的最大地址
使用的总内存(由于分段造成0 MB丢失)

第0代:5504集合,0平行,0.06s,0.03s经过
第1代:1集合,0平行,0.00s,0.00s经过

INIT时间0.00s(经过0.00s)
MUT时间1.61s(经过1.66s)
GC时间0.06s(经过0.03s)
EXIT时间0.00s(经过0.00s)
总时间1.67s(经过1.69s)

%GC时间3.7%(已过去1.5%)

分配率每MUT 1,795,998,050个字节第二个

生产力总用户的96.3%,已用完总数的95.2%

这里有三个关键的统计数据:


  1. 使用的总内存:只有1MB意味着这个版本非常节省空间。 / li>
  2. 总时间:1.61s现在没有任何意义,但我们将看到它与其他实现的差异。 生产力:这只是100%减去垃圾收集;因为我们达到了96.3%,这意味着我们不会创建大量的对象,而是留在内存中。

好吧,我们来看看版本2.

  digits2 + RTS -sstderr 
12000000
5,512,869,824字节在堆中分配
在GC
期间复制的1,312,416字节最大驻留时间(1个样本)3,336字节
13,048字节最大值slb
1 MB正在使用的总内存(0 MB丢失由于碎片化)

第0代:10515集合,0平行,0.06s,0.04s经过
第1代:1集合,0平行,0.00s,0.00s经过

INIT时间0.00s(经过0.00s)
MUT时间3.20s(经过3.25s)
GC时间0.06s(经过0.04s)
EXIT时间0.00s(0.00s已过)
总时间3.26s(已过3.29s)

%GC时间1.9%(已过期1.2%)

分配率1,每MUT 723,838,984个字节第二个

生产率总用户的98.1%,已用完总数的97.1%

好的,所以我们看到了一个有趣的模式。


  1. 使用的内存量相同。这意味着这是一个相当不错的实现,尽管这可能意味着我们需要测试更高的样本输入以查看是否可以找到差异。
  2. 需要两倍的时间。我们会回过头来看看为什么会这么晚。

  3. 实际上它的效率稍高一点,但考虑到GC不是这两个程序中的大部分,这并不能帮助我们

版本3:

  digits3 + RTS -sstderr 
12000000
在堆中分配的3,231,154,752字节
在GC
期间复制的832,724字节最大驻留期间的3,292字节(1个样本)
12,100字节的最大值slb
使用的总内存1 MB(由于分段造成的0 MB丢失)

第0代:6163集合,0并行,0.02s,0.02s经过
代1:1集合,0平行,0.00s,0.00s经过

INIT时间0.00s(经过0.00s)
MUT时间2.09s(经过2.08s)
GC时间0.02s(经过0.02s)
出口时间0.00s(经过0.00s)
总时间2.11s(已过2.10s)

%GC时间0.7%(已用1.0%)

分配给每个MUT的1,545,701,615个字节第二个

生产力总用户的99.3%,99.3%

好的,所以我们看到一些奇怪的模式。


  1. 我们仍在使用1MB总内存。所以我们并没有遇到任何内存效率低下的问题,这是很好的。

  2. 我们不是数字1,但是我们有digit2轻松打败。

  3. 非常少的GC。 (请记住,超过95%的生产力是非常好的,所以我们在这里没有真正处理太重要的事情。)

最后,版本4:

  digits4 + RTS -sstderr 
12000000
1,347,856,636字节分配在堆
在GC
期间复制的270,692字节最大驻留期限为3,180字节(1个样本)
12,100字节最大短缺
1 MB使用的总内存(由于碎片导致0 MB丢失)

第0代:2570个集合,0平行,0.00s,0.01s经过
第1代:1集合,0平行,0.00s,0.00s经过

INIT时间0.00s(经过0.00s)
MUT时间1.09s(经过1.08s)
GC时间0.00s(经过0.01s)
EXIT时间0.00s(经过0.00s)
总时间1.09s(经过1.09s)

%GC时间0.0%(已过0.8%)

分配率1,每MUT 234,293,036字节第二个

生产力总用户的100.0%,已用完总数的100.5%



Wowza!让我们来分析一下:


  1. 我们仍然总共有1MB。这几乎肯定是这些实现的一个特性,因为它们在5e5和5e7的输入上保持在1MB。如果您愿意的话,可以证明您是懒惰的。

  2. 我们截断了原始时间的32%,这非常令人印象深刻。

  3. 我怀疑这里的百分比反映了-sstderr监测的粒度,而不是任何超光速粒子的计算。


Converting non-negative Integer to its list of digits is commonly done like this:

import Data.Char

digits :: Integer -> [Int]
digits = (map digitToInt) . show

I was trying to find a more direct way to perform the task, without involving a string conversion, but I'm unable to come up with something faster.

Things I've been trying so far:

The baseline:

digits :: Int -> [Int]
digits = (map digitToInt) . show

Got this one from another question on StackOverflow:

digits2 :: Int -> [Int]
digits2 = map (`mod` 10) . reverse . takeWhile (> 0) . iterate (`div` 10)

Trying to roll my own:

digits3 :: Int -> [Int]
digits3 = reverse . revDigits3

revDigits3 :: Int -> [Int]
revDigits3 n = case divMod n 10 of
               (0, digit) -> [digit]
               (rest, digit) -> digit:(revDigits3 rest)

This one was inspired by showInt in Numeric:

digits4 n0 = go n0 [] where
    go n cs
        | n < 10    =  n:cs
        | otherwise =  go q (r:cs)
        where
        (q,r) = n `quotRem` 10

Now the benchmark. Note: I'm forcing the evaluation using filter.

λ>:set +s
λ>length $ filter (>5) $ concat $ map (digits) [1..1000000]
2400000
(1.58 secs, 771212628 bytes)

This is the reference. Now for digits2:

λ>length $ filter (>5) $ concat $ map (digits2) [1..1000000]
2400000
(5.47 secs, 1256170448 bytes)

That's 3.46 times longer.

λ>length $ filter (>5) $ concat $ map (digits3) [1..1000000]
2400000
(7.74 secs, 1365486528 bytes)

digits3 is 4.89 time slower. Just for fun, I tried using only revDigits3 and avoid the reverse.

λ>length $ filter (>5) $ concat $ map (revDigits3) [1..1000000]
2400000
(8.28 secs, 1277538760 bytes)

Strangely, this is even slower, 5.24 times slower.

And the last one:

λ>length $ filter (>5) $ concat $ map (digits4) [1..1000000]
2400000
(16.48 secs, 1779445968 bytes)

This is 10.43 time slower.

I was under the impression that only using arithmetic and cons would outperform anything involving a string conversion. Apparently, there something I can't grasp.

So what the trick? Why is digits so fast?

I'm using GHC 6.12.3.

解决方案

Seeing as I can't add comments yet, I'll do a little bit more work and just analyze all of them. I'm putting the analysis at the top; however, the relevant data is below. (Note: all of this is done in 6.12.3 as well - no GHC 7 magic yet.)


Analysis:

Version 1: show is pretty good for ints, especially those as short as we have. Making strings actually tends to be decent in GHC; however reading to strings and writing large strings to files (or stdout, although you wouldn't want to do that) are where your code can absolutely crawl. I would suspect that a lot of the details behind why this is so fast are due to clever optimizations within show for Ints.

Version 2: This one was the slowest of the bunch when compiled. Some problems: reverse is strict in its argument. What this means is that you don't benefit from being able to perform computations on the first part of the list while you're computing the next elements; you have to compute them all, flip them, and then do your computations (namely (`mod` 10) ) on the elements of the list. While this may seem small, it can lead to greater memory usage (note the 5GB of heap memory allocated here as well) and slower computations. (Long story short: don't use reverse.)

Version 3: Remember how I just said don't use reverse? Turns out, if you take it out, this one drops to 1.79s total execution time - barely slower than the baseline. The only problem here is that as you go deeper into the number, you're building up the spine of the list in the wrong direction (essentially, you're consing "into" the list with recursion, as opposed to consing "onto" the list).

Version 4: This is a very clever implementation. You benefit from several nice things: for one, quotRem should use the Euclidean algorithm, which is logarithmic in its larger argument. (Maybe it's faster, but I don't believe there's anything that's more than a constant factor faster than Euclid.) Furthermore, you cons onto the list as discussed last time, so that you don't have to resolve any list thunks as you go - the list is already entirely constructed when you come back around to parse it. As you can see, the performance benefits from this.

This code was probably the slowest in GHCi because a lot of the optimizations performed with the -O3 flag in GHC deal with making lists faster, whereas GHCi wouldn't do any of that.

Lessons: cons the right way onto a list, watch for intermediate strictness that can slow down computations, and do some legwork in looking at the fine-grained statistics of your code's performance. Also compile with the -O3 flags: whenever you don't, all those people who put a lot of hours into making GHC super-fast get big ol' puppy eyes at you.


Data:

I just took all four functions, stuck them into one .hs file, and then changed as necessary to reflect the function in use. Also, I bumped your limit up to 5e6, because in some cases compiled code would run in less than half a second on 1e6, and this can start to cause granularity problems with the measurements we're making.

Compiler options: use ghc --make -O3 [filename].hs to have GHC do some optimization. We'll dump statistics to standard error using digits +RTS -sstderr.

Dumping to -sstderr gives us output that looks like this, in the case of digits1:

digits1 +RTS -sstderr
12000000
   2,885,827,628 bytes allocated in the heap
         446,080 bytes copied during GC
           3,224 bytes maximum residency (1 sample(s))
          12,100 bytes maximum slop
               1 MB total memory in use (0 MB lost due to fragmentation)

  Generation 0:  5504 collections,     0 parallel,  0.06s,  0.03s elapsed
  Generation 1:     1 collections,     0 parallel,  0.00s,  0.00s elapsed

  INIT  time    0.00s  (  0.00s elapsed)
  MUT   time    1.61s  (  1.66s elapsed)
  GC    time    0.06s  (  0.03s elapsed)
  EXIT  time    0.00s  (  0.00s elapsed)
  Total time    1.67s  (  1.69s elapsed)

  %GC time       3.7%  (1.5% elapsed)

  Alloc rate    1,795,998,050 bytes per MUT second

  Productivity  96.3% of total user, 95.2% of total elapsed

There are three key statistics here:

  1. Total memory in use: only 1MB means this version is very space-efficient.
  2. Total time: 1.61s means nothing now, but we'll see how it looks against the other implementations.
  3. Productivity: This is just 100% minus garbage collecting; since we're at 96.3% this means that we're not creating a lot of objects that we leave lying around in memory..

Alright, let's move on to version 2.

digits2 +RTS -sstderr
12000000
   5,512,869,824 bytes allocated in the heap
       1,312,416 bytes copied during GC
           3,336 bytes maximum residency (1 sample(s))
          13,048 bytes maximum slop
               1 MB total memory in use (0 MB lost due to fragmentation)

  Generation 0: 10515 collections,     0 parallel,  0.06s,  0.04s elapsed
  Generation 1:     1 collections,     0 parallel,  0.00s,  0.00s elapsed

  INIT  time    0.00s  (  0.00s elapsed)
  MUT   time    3.20s  (  3.25s elapsed)
  GC    time    0.06s  (  0.04s elapsed)
  EXIT  time    0.00s  (  0.00s elapsed)
  Total time    3.26s  (  3.29s elapsed)

  %GC time       1.9%  (1.2% elapsed)

  Alloc rate    1,723,838,984 bytes per MUT second

  Productivity  98.1% of total user, 97.1% of total elapsed

Alright, so we're seeing an interesting pattern.

  1. Same amount of memory used. This means that this is a pretty good implementation, although it could mean that we need to test on higher sample inputs to see if we can find a difference.
  2. It takes twice as long. We'll come back to some speculation as to why this is later.
  3. It's actually slightly more productive, but given that GC is not a huge portion of either program this doesn't help us anything significant.

Version 3:

digits3 +RTS -sstderr
12000000
   3,231,154,752 bytes allocated in the heap
         832,724 bytes copied during GC
           3,292 bytes maximum residency (1 sample(s))
          12,100 bytes maximum slop
               1 MB total memory in use (0 MB lost due to fragmentation)

  Generation 0:  6163 collections,     0 parallel,  0.02s,  0.02s elapsed
  Generation 1:     1 collections,     0 parallel,  0.00s,  0.00s elapsed

  INIT  time    0.00s  (  0.00s elapsed)
  MUT   time    2.09s  (  2.08s elapsed)
  GC    time    0.02s  (  0.02s elapsed)
  EXIT  time    0.00s  (  0.00s elapsed)
  Total time    2.11s  (  2.10s elapsed)

  %GC time       0.7%  (1.0% elapsed)

  Alloc rate    1,545,701,615 bytes per MUT second

  Productivity  99.3% of total user, 99.3% of total elapsed

Alright, so we're seeing some strange patterns.

  1. We're still at 1MB total memory in use. So we haven't hit anything memory-inefficient, which is good.
  2. We're not quite at digits1, but we've got digits2 beat pretty easily.
  3. Very little GC. (Keep in mind that anything over 95% productivity is very good, so we're not really dealing with anything too significant here.)

And finally, version 4:

digits4 +RTS -sstderr
12000000
   1,347,856,636 bytes allocated in the heap
         270,692 bytes copied during GC
           3,180 bytes maximum residency (1 sample(s))
          12,100 bytes maximum slop
               1 MB total memory in use (0 MB lost due to fragmentation)

  Generation 0:  2570 collections,     0 parallel,  0.00s,  0.01s elapsed
  Generation 1:     1 collections,     0 parallel,  0.00s,  0.00s elapsed

  INIT  time    0.00s  (  0.00s elapsed)
  MUT   time    1.09s  (  1.08s elapsed)
  GC    time    0.00s  (  0.01s elapsed)
  EXIT  time    0.00s  (  0.00s elapsed)
  Total time    1.09s  (  1.09s elapsed)

  %GC time       0.0%  (0.8% elapsed)

  Alloc rate    1,234,293,036 bytes per MUT second

  Productivity 100.0% of total user, 100.5% of total elapsed

Wowza! Let's break it down:

  1. We're still at 1MB total. This is almost certainly a feature of these implementations, as they remain at 1MB on inputs of 5e5 and 5e7. A testament to laziness, if you will.
  2. We cut off about 32% of our original time, which is pretty impressive.
  3. I suspect that the percentages here reflect the granularity in -sstderr's monitoring rather than any computation on superluminal particles.

这篇关于为什么(地图digitToInt)。表演`是如此之快?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆