Haskell中的高效内存字符串 [英] Memory efficient strings in Haskell

查看:73
本文介绍了Haskell中的高效内存字符串的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

通常推荐的Haskell字符串类型似乎是ByteString或Text。我经常使用大量短的(英文字大小)字符串,通常需要将它们存储在诸如Data.Map之类的查找表中。在很多情况下,我发现在这种情况下,一个字符串表可以占用较少的内存,然后是一个ByteStrings表。无箱数据.Word8的矢量图也比ByteStrings更紧凑。



当需要存储和比较Haskell中的大量小字符串时,最佳实践是什么?



下面我试图将一个有问题的案例压缩成一个小例子:

 将合格的Data.ByteString.Lazy.Char8导入为S 
将合格的Data.ByteString导入为Strict
将合格的Data.Map导入为映射
将合格的Data.Vector.Unboxed导入为U
将合格的Data.Serialize导入为序列化
导入Control.Monad.State

main = putStr
。不合格。地图显示。 flip evalState(0,Map.empty)
。 mapM toInt
。 S.words
=<<
S.getContents


toInt x = do
let x'=
U.fromList。 Strict.unpack。 - 注释这一行以增加内存使用量
Serialize.encode $ x
(i,t)< - get
case Map.lookup x't of
Just j - >返回j
Nothing - >做
让i'= i +(1 :: Int)
put(i',Map.insert x'it)
return i

当我在一个包含大约400,000字英文文本的文件上运行这个时,带有严格字节串键的版本使用大约50MB的内存,与Word8向量一起使用6MB。

解决方案

在没有其他答案的情况下,我会在这里出去。


当需要存储和比较Haskell中的大量小字符串时,最佳实践是什么?


如果小字符串是人类可读的(例如英文单词),则使用 Text 。如果它们只能被计算机读取,请使用 ByteString 。决定使用严格还是懒惰的变体取决于你如何构建和使用这些小字符串。



你不需要使用你自己的unboxed Vector s Word8 。如果遇到特定情况,例如 String Text ByteString ,然后将详细信息放在StackOverflow上,我们将尝试找出原因。如果您执行详细分析,并且可以证明 Word8 的一个未装箱的 Vector 持续工作的效果明显优于 Text ByteString ,然后在邮件列表,irc,reddit等上开始对话;标准图书馆并不是一成不变的,并且总是欢迎改进。

但我认为你很可能正在做一些奇怪的事情,正如Hammar和上司所建议的那样。

PS对于您的特定用例,您应该考虑一个更适合您的需求的数据结构,而不是存储很多小字符串。 Trie作为丹尔建议。

The commonly recommended Haskell string types seem to be ByteString or Text. I often work with a very large number of short (English word sized) strings, and typically need to store them in a lookup table such as Data.Map. In many cases I find that in this scenario, a table of Strings can take up less memory then a table of ByteStrings. Unboxed Data.Vectors of Word8 are also (much) more compact than ByteStrings.

What is the best practice when one needs to store and compare large numbers of small strings in Haskell?

Below I have tried to condense a particular problematic case into a small example:

import qualified Data.ByteString.Lazy.Char8 as S
import qualified Data.ByteString as Strict
import qualified Data.Map as Map
import qualified Data.Vector.Unboxed as U
import qualified Data.Serialize as Serialize
import Control.Monad.State

main =   putStr 
  . unlines . map show . flip evalState (0,Map.empty) 
  . mapM toInt 
  . S.words
  =<<
  S.getContents


toInt x = do  
  let x' =   
          U.fromList . Strict.unpack .  -- Comment this line to increase memory usage
           Serialize.encode $ x  
  (i,t) <- get
  case Map.lookup x' t of
    Just j -> return j
    Nothing -> do 
      let i' = i + (1::Int)
      put (i', Map.insert x' i t)
      return i

When I run this on a file containing around 400.000 words of English text, the version with strict bytestring keys uses around 50MB memory, the one with Word8 vectors uses 6MB.

解决方案

In the absence of other answers, I'm going to go out on a limb here.

What is the best practice when one needs to store and compare large numbers of small strings in Haskell?

If the small strings are meant to be human readable (e.g. an English word) then use Text. If they are meant to be read only by the computer, use ByteString. The decision to use strict or lazy variants of these depends on how you build and use these small strings.

You shouldn't need to use your own unboxed Vectors of Word8. If you experiencing a specific situation where regular String is faster than Text or ByteString, then throw the details up on StackOverflow and we'll try to figure out why. If you perform detailed analysis and can prove that an unboxed Vector of Word8 consistently works significantly better than Text or ByteString, then start conversations on mailing lists, irc, reddit, etc; the standard libraries are not set in stone, and improvements are always welcome.

But I think it highly likely that you are just doing something weird, as hammar and shang suggest.

P.S. for your particular use case, instead of storing a lot of small strings, you should consider a more appropriate data structure catered to your needs, e.g. a Trie as danr suggests.

这篇关于Haskell中的高效内存字符串的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆