什么是空间泄漏? [英] What's a space leak?

查看:120
本文介绍了什么是空间泄漏?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我发现了关于空间泄漏的haskell维基页面,声称列举了真实世界泄漏的例子,但事实并非如此。它并没有真正说出太空泄漏是什么;它只是链接到内存泄漏的页面。



什么是空间泄露?

解决方案

答案,空间泄漏是指程序或特定计算使用比计算所需和/或程序员所期望的更多(通常更多)内存的情况。



< Haskell程序往往特别容易受到空间泄漏的影响,这主要是因为惰性评估模型(有时与IO与此模型相互作用的复杂性)以及语言的高度抽象性,这使得程序员难以确定究竟如何执行一个特定的计算。



它有助于考虑一个具体的例子。这个Haskell程序:

$ $ p $ $ $ c $ main $ print $ sum [1..1000000000]

是对十亿个整数求和的一种习惯方式。用 -O2 编译,它在常量内存中运行几秒钟(几兆字节,基本上是运行时开销)。



现在,任何程序员都会期望一个程序能够总结出前十亿个整数应该在不咀嚼内存的情况下运行,但实际上这个Haskell版本表现得有些令人惊讶。毕竟,从字面上看,它在总结之前构造了一个十亿个整数的列表,所以它应该至少需要几千兆字节(仅用于存储十亿个整数,更不用说Haskell链接列表的开销)。然而,懒惰的评估确保列表仅在需要时生成 ,同样重要的是,编译器执行的优化可确保当列表元素被添加到累加和时,程序认识到它们不再需要,并且允许它们被垃圾收集,而不是保持它们到计算结束。因此,在计算过程中的任何时刻,只有列表中间的滑动窗口需要保存在内存中 - 先前的元素已被丢弃,后来的元素还没有被延迟计算。 (实际上,优化比这更进一步:没有列表甚至是构建的,但对于程序员来说这显然不明显。)

Soooo ... Haskell程序员获得习惯于围绕巨型(甚至是无限)数据结构进行折腾的想法,只会自动使用他们需要的内存进行计算。



但是,一个小的变化到程序,就像打印列表的长度,作为我们所做的所有努力的证明:

  main = let vals = [1..1000000000] 
in print(sum vals,length vals)

突然导致空间使用爆炸到几十千兆字节(或者在我的笔记本电脑的情况下,大约13Gigs在它开始交换之前绝望地移动并且我杀死它)。
$ b 这是空间泄漏。计算列表的总和和长度显然是可以在恒定空间中使用列表中的滑动窗口视图完成的事情,但上面的程序使用的内存比需要的多得多。原因是,一旦列表中的名称 vals 在两处使用,编译器不再允许立即使用used元素丢弃。如果首先计算 sum vals ,则该列表将被延迟生成并相加,但整个巨大列表将保持不变直到 length vals <可以评估code>。

作为一个更实际的例子,你可以编写一个简单的程序来计算文件中的单词和字符:

  main = do txt<  -  getContents 
print(length txt,length(words txt))
<$>
$ b

这适用于几兆字节的小测试文件,但在10meg文件上显然很慢,如果您尝试运行它一个100meg的文件,它会缓慢但肯定地开始吞噬所有可用的内存。同样,问题在于 - 即使文件内容被懒惰地读入 txt - 因为使用了 txt 两次,整个内容作为Haskell 字符串类型(大块文本的内存低效表示)读入内存,例如长度txt 被评估,并且没有任何内存可以被释放,直到长度(单词txt)也被计算出来。



请注意:

  main = do txt<  -  getContents 
print $ length txt



  main = do txt<  -  getContents 
print $ length(words txt)

即使在大文件中也能在恒定的空间中快速运行。

作为一个方面说明,修正上面的空间泄漏通常包括重写计算,以便字符和单词通过内容一次计数,所以编译器可以确定已经被处理的文件的内容不需要一直保留在内存中直到计算结束。一种可能的解决方案是:

  { - #LANGUAGE BangPatterns# - } 

import Data.List
导入Data.Char

charsWords :: String - > (int,int)
charsWords str = let(_,chrs,wrds)= foldl'step(False,0,0)str
in(chrs,wrds)
其中step(inWord ,cs,ws)c =
let!cs'= succ cs
!ws'= if in inWord&& (inWord',cs',ws')

main = do txt< - inWord' getContents
print $ charsWords txt

这个解决方案的复杂性(使用bang()模式和显式折叠而不是长度)说明了空间泄漏的严重程度,特别是对于新的Haskell程序员。使用 foldl'而不是 foldl 没有任何区别(但是使用 foldr foldr'会是一个灾难!),在 cs'和 ws'对于避免空间泄漏非常重要,但是在 inWord'之前的爆炸并不是(尽管它略微提高了性能)等。


I found the haskell wiki page on space leaks, which claims to list examples of real-world leaks, which it doesn't. It doesn't really say what a space leak is; it just links to the page for memory leaks.

What's a space leak?

解决方案

As noted in @Rasko's answer, a space leak refers to a situation where a program or specific computation uses more (usually much more) memory than is necessary for the computation and/or expected by the programmer.

Haskell programs tend to be particularly susceptible to space leaks, mostly because of the lazy evaluation model (sometimes complicated by the way IO interacts with this model) and the highly abstract nature of the language which can make it difficult for a programmer to determine exactly how a particular computation is likely to be performed.

It helps to consider a specific example. This Haskell program:

main = print $ sum [1..1000000000]

is an idiomatic way to sum the first billion integers. Compiled with -O2, it runs in a few seconds in constant memory (a few megabytes, basically the runtime overhead).

Now, any programmer would expect a program to sum the first billion integers should run without chewing up memory, but it's actually a little surprising that this Haskell version is well behaved. After all, read literally, it constructs a list of a billion integers before summing them up, so it ought to require at least a few gigabytes (just for storage for the billion integers, not to mention the overhead of a Haskell linked list).

However, lazy evaluation ensures that the list is only generated as it's needed and -- equally importantly -- optimizations performed by the compiler ensure that as list elements are added to the accumulating sum, the program recognizes they are no longer needed and allows them to be garbage collected instead of keeping them around until the end of the computation. So, at any point during the computation, only a sliding "window" into the middle of the list needs to be kept in memory -- earlier elements have been discarded, and later elements are yet to be lazily computed. (In fact, the optimizations go further than this: no list is even constructed, but this is far from obvious to the programmer.)

Soooo... Haskell programmers get used to the idea that tossing around giant (or even infinite) data structures will "just work" with computations automatically using only the memory they need.

But, a minor change to the program, like also printing the length of the list as proof of all the hard work we are doing:

main = let vals = [1..1000000000]
       in print (sum vals, length vals)

suddenly causes space usage to explode to dozens of gigabytes (or in the case of my laptop, to about 13Gigs before it starts swapping hopelessly and I kill it).

This is a space leak. Calculating the sum and length of this list are obviously things that can be done in constant space using a "sliding window" view into the list, but the above program uses much more memory than needed. The reason, it turns out, is that once the list has been given a name vals that's used in two places, the compiler no longer allows the "used" elements to be immediately discarded. If the sum vals is evaluated first, the list is lazily generated and summed, but the entire, giant list is then kept around until length vals can be evaluated.

As a more practical example, you might write a simple program to count words and characters in a file:

main = do txt <- getContents
          print (length txt, length (words txt))

This works fine on small test files up to a couple megabytes, but it's noticeably sluggish on 10meg file, and if you try to run it on a 100meg file, it'll slowly but surely start gobbling up all available memory. Again, the problem is that -- even though the file contents are read lazily into txt -- because txt is used twice, the entire contents are read into memory as a Haskell String type (a memory-inefficient representation of large blocks of text) when, say, length txt is evaluated, and none of that memory can be freed until length (words txt) has also been computed.

Note that:

main = do txt <- getContents
          print $ length txt

and:

main = do txt <- getContents
          print $ length (words txt)

both run quickly in constant space even on big files.

As a side note, fixing the above space leak normally involves rewriting the computation so the characters and words are counted with one pass through the contents, so the compiler can determine that the contents of the file that have already been processed do not need to be kept around in memory until the end of the computation. One possible solution is:

{-# LANGUAGE BangPatterns #-}

import Data.List
import Data.Char

charsWords :: String -> (Int, Int)
charsWords str = let (_, chrs, wrds) = foldl' step (False, 0, 0) str
                 in (chrs, wrds)
  where step (inWord, cs, ws) c =
          let !cs' = succ cs
              !ws' = if not inWord && inWord' then succ ws else ws
              !inWord' = not (isSpace c)
          in (inWord', cs', ws')

main = do txt <- getContents
          print $ charsWords txt

The complexity of this solution (use of bang (!) patterns and an explicit fold instead of length and words) illustrates how tough space leaks can be, especially for new Haskell programmers. And it's not at all obvious that using foldl' instead of foldl makes no difference (but using foldr or foldr' would be a disaster!), that the bangs before cs' and ws' are critical to avoid a space leak, but that the bang before inWord' isn't (though it slightly improves performance), etc.

这篇关于什么是空间泄漏?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆