当我从文件读取地图时,Haskell比Python需要更多的内存。为什么? [英] Haskell requires more memory than Python when I read map from file. Why?

查看:120
本文介绍了当我从文件读取地图时,Haskell比Python需要更多的内存。为什么?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在Python中拥有这个简单的代码:

  input = open(baseforms.txt,r,encoding ='utf8')
S = {}
for input in:
words = i.split()
S.update({j:words [0] for j (b)(b)(b)b $ b print(sget(sometext,not found))
print(len(S))

工作需要300MB。 baseforms.txt的大小是123M。
我在Haskell中编写了相同的代码:

$ $ p $ {$ c $ {>#LANGUAGE OverloadedStrings# - }
将合格的Data.Map导入为M
将合格的Data.ByteString.Lazy.Char8导入为B
导入Data.Text.Lazy.Encoding(decodeUtf8)
将合格的Data.Text.Lazy导入为T
将合格的Data.Text.Lazy.IO导入为I
import Control.Monad(liftM)

main = do
text< - B.readFilebaseforms .txt
let m =(M.fromList。(concatMap(parseLine.decodeUtf8)))(blines text)
print(M.lookupsometextm)
print( M.size m)
where
parseLine line = let base:forms = T.words line in [(f,base)| f< -forms]

它需要544 MB,比Python版本要慢。为什么?是否有可能优化Haskell版本?

解决方案

这有点晚,但我研究了一下,认为Dietrich Epp的帐户是对,但可以简化一点。请注意,python文件中似乎没有任何真正的Python编程:它正在编排一个非常简单的对C字符串操作的调用序列,然后再对C哈希表实现进行调用。 (这对于真正简单的python v Haskell基准测试通常是一个问题)。相比之下,Haskell正在构建一个巨大的 Map ,这是一个奇特的树。所以这里反对的主要观点是C vs Haskell,hashtable-with-destructive-update vs persistent map。由于输入文件中没有重叠,因此您构建的树包含输入字符串中的所有信息,其中一些信息重复,然后用一堆Haskell构造函数重新排列。这是我认为你正在经历的警报的来源,但它可以解释。



比较这两个文件,一个使用ByteString:

 将合格的Data.Map导入为M 
将合格的Data.ByteString.Char8导入为B
main = do m < - fmap proc (B.readFilebaseforms.txt)
print(M.lookup(B.packsometext)m)
print(M.size m)
proc = M.fromList。 concatMap(\(a:bs) - > map(flip(,)a)bs)
。地图B.words。 B.lines

另一个是文本编辑等效:

 将限定的Data.Map导入为M 
将限定的Data.ByteString.Char8导入为B
import Data.Text.Encoding(decodeUtf8)
将合格的Data.Text导入为T

main = do
m< - fmap proc(B.readFilebaseforms.txt)
print(M.lookup(T .packsometext)m)
print(M.size m)

proc = M.fromList。 concatMap(\(a:bs) - > map(flip(,)a)bs)
。映射T.words。 T.lines。 decodeUtf8

在我的机器上,python / C只需不到6秒,bytestring文件需要8秒,并且文本文件刚好超过了10个。

bytestring实现似乎比python使用更多内存,文本实现明显更多。文本实现需要更多时间,因为当然,它会向文本添加转换,然后使用文本操作来拆分字符串和文本比较以构建地图。



此处是分析案例中记忆现象的一种去处。首先,我们在内存中有字节串(130米)。一旦构建了文本(〜250m,根据 top 中的内容进行不科学判断),字符串在我们构建树时被垃圾收集。最后,文本树(看起来〜380m)使用比字符串树(〜260m)更多的内存,因为树中的文本片段更大。作为一个整体的程序使用更多,因为在树建设过程中保存在内存中的文本本身就更大。简单地说:每一个白色空间都被变成了一个树构造函数,两个文本构造函数以及该行的第一个单词的文本版本以及下一个单词的文本表示。在任何情况下,构造函数的权重大约为130米,所以在构建树的最后一刻,我们在字节串的情况下使用了130m + 130m + 130m = 390m,250m + 130m + 250m = 630m在案例中。

I have this simple code in Python:

input = open("baseforms.txt","r",encoding='utf8')
S = {}
for i in input:
    words = i.split()
    S.update( {j:words[0] for j in words} )
print(S.get("sometext","not found"))
print(len(S))

It requires 300MB for work. "baseforms.txt" size is 123M. I've wrote the same code in Haskell:

{-# LANGUAGE OverloadedStrings #-}
import qualified Data.Map as M
import qualified Data.ByteString.Lazy.Char8 as B
import Data.Text.Lazy.Encoding(decodeUtf8)
import qualified Data.Text.Lazy as T
import qualified Data.Text.Lazy.IO as I
import Control.Monad(liftM)

main = do
    text <- B.readFile "baseforms.txt"
    let m = (M.fromList . (concatMap (parseLine.decodeUtf8))) (B.lines text)
    print (M.lookup "sometext" m)
    print (M.size m)
    where
        parseLine line = let base:forms = T.words line in [(f,base)| f<-forms]

It requires 544 MB and it's slower than Python version. Why? Is it possible to optimise Haskell version?

解决方案

It's a bit late, but I studied this a little and think Dietrich Epp's account is right, but can be simplified a little. Notice that there doesn't seem to be any real python programming going on in the python file: it is orchestrating a very simple sequence of calls to C string operations and then to a C hash table implementation. (This is often a problem with really simple python v. Haskell benchmarks.) The Haskell, by contrast, is building an immense persistent Map, which is a fancy tree. So the main points of opposition here are C vs Haskell, and hashtable-with-destructive-update vs persistent map. Since there is little overlap in the input file, the tree you are constructing includes all the information in the input string, some of it repeated, and then rearranged with a pile of Haskell constructors. This is I think the source of the alarm you are experiencing, but it can be explained.

Compare these two files, one using ByteString:

import qualified Data.Map as M
import qualified Data.ByteString.Char8 as B
main = do m <- fmap proc (B.readFile "baseforms.txt")
          print (M.lookup (B.pack "sometext") m)
          print (M.size m)
proc = M.fromList  . concatMap (\(a:bs) -> map (flip (,) a) bs) 
       . map B.words . B.lines

and the other a Text-ified equivalent:

import qualified Data.Map as M
import qualified Data.ByteString.Char8 as B
import Data.Text.Encoding(decodeUtf8)
import qualified Data.Text as T

main = do
    m <- fmap proc (B.readFile "baseforms.txt")
    print (M.lookup (T.pack "sometext") m)
    print (M.size m)

proc = M.fromList  . concatMap (\(a:bs) ->  map (flip (,) a) bs)
       . map T.words . T.lines  . decodeUtf8

On my machine, the python/C takes just under 6 seconds, the bytestring file takes 8 seconds, and the text file just over 10.

The bytestring implementation seems to use a bit more memory than the python, the text implementation distinctly more. The text implementation takes more time because, of course, it adds a conversion to text and then uses text operations to break the string and text comparisons to build the map.

Here is a go at analyzing the memory phenomena in the text case. First we have the bytestring in memory (130m). Once the text is constructed (~250m, to judge unscientifically from what's going on in top), the bytestring is garbage collected while we construct the tree. In the end the text tree (~380m it looks like) uses more memory than the bytestring tree (~260m) because the text fragments in the tree are bigger. The program as a whole uses more because the text held in memory during the tree construction is itself bigger. To put it crudely: each bit of white-space is being turned into a tree constructor and two text constructors together with the text version of whatever the first 'word' of the line was and whatever the text representation next word is. The weight of the constructors seems in either case to be about 130m, so at the last moment of the construction of the tree we are using something like 130m + 130m + 130m = 390m in the bytestring case, and 250m + 130m + 250m = 630m in the text case.

这篇关于当我从文件读取地图时,Haskell比Python需要更多的内存。为什么?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆