Haskell中的高效内存字符串 [英] Memory efficient strings in Haskell
问题描述
当需要存储和比较Haskell中的大量小字符串时,最佳实践是什么?
下面我试图将一个有问题的案例压缩成一个小例子:
将合格的Data.ByteString.Lazy.Char8导入为S
将合格的Data.ByteString导入为Strict
将合格的Data.Map导入为映射
将合格的Data.Vector.Unboxed导入为U
将合格的Data.Serialize导入为序列化
导入Control.Monad.State
main = putStr
。不合格。地图显示。 flip evalState(0,Map.empty)
。 mapM toInt
。 S.words
=<<
S.getContents
toInt x = do
let x'=
U.fromList。 Strict.unpack。 - 注释这一行以增加内存使用量
Serialize.encode $ x
(i,t)< - get
case Map.lookup x't of
Just j - >返回j
Nothing - >做
让i'= i +(1 :: Int)
put(i',Map.insert x'it)
return i
当我在一个包含大约400,000字英文文本的文件上运行这个时,带有严格字节串键的版本使用大约50MB的内存,与Word8向量一起使用6MB。
在没有其他答案的情况下,我会在这里出去。
当需要存储和比较Haskell中的大量小字符串时,最佳实践是什么?
如果小字符串是人类可读的(例如英文单词),则使用 Text
。如果它们只能被计算机读取,请使用 ByteString
。决定使用严格还是懒惰的变体取决于你如何构建和使用这些小字符串。
你不需要使用你自己的unboxed 但我认为你很可能正在做一些奇怪的事情,正如Hammar和上司所建议的那样。 PS对于您的特定用例,您应该考虑一个更适合您的需求的数据结构,而不是存储很多小字符串。 Trie作为丹尔建议。 The commonly recommended Haskell string types seem to be ByteString or Text. I often work with a very large number of short (English word sized) strings, and typically need to store them in a lookup table such as Data.Map. In many cases I find that in this scenario, a table of Strings can take up less memory then a table of ByteStrings. Unboxed Data.Vectors of Word8 are also (much) more compact than ByteStrings. What is the best practice when one needs to store and compare large numbers of small strings in Haskell? Below I have tried to condense a particular problematic case into a small example: When I run this on a file containing around 400.000 words of English text, the version with strict bytestring keys uses around 50MB memory, the one with Word8 vectors uses 6MB. In the absence of other answers, I'm going to go out on a limb here. What is the best practice when one needs to store and compare large numbers of small strings in Haskell? If the small strings are meant to be human readable (e.g. an English word) then use You shouldn't need to use your own unboxed But I think it highly likely that you are just doing something weird, as hammar and shang suggest. P.S. for your particular use case, instead of storing a lot of small strings, you should consider a more appropriate data structure catered to your needs, e.g. a Trie as danr suggests. 这篇关于Haskell中的高效内存字符串的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋! Vector
s Word8
。如果遇到特定情况,例如 String
比 Text
或 ByteString $快c $ c>,然后将详细信息放在StackOverflow上,我们将尝试找出原因。如果您执行详细分析,并且可以证明
Word8
的一个未装箱的 Vector
持续工作的效果明显优于 Text
或 ByteString
,然后在邮件列表,irc,reddit等上开始对话;标准图书馆并不是一成不变的,并且总是欢迎改进。
import qualified Data.ByteString.Lazy.Char8 as S
import qualified Data.ByteString as Strict
import qualified Data.Map as Map
import qualified Data.Vector.Unboxed as U
import qualified Data.Serialize as Serialize
import Control.Monad.State
main = putStr
. unlines . map show . flip evalState (0,Map.empty)
. mapM toInt
. S.words
=<<
S.getContents
toInt x = do
let x' =
U.fromList . Strict.unpack . -- Comment this line to increase memory usage
Serialize.encode $ x
(i,t) <- get
case Map.lookup x' t of
Just j -> return j
Nothing -> do
let i' = i + (1::Int)
put (i', Map.insert x' i t)
return i
Text
. If they are meant to be read only by the computer, use ByteString
. The decision to use strict or lazy variants of these depends on how you build and use these small strings.Vector
s of Word8
. If you experiencing a specific situation where regular String
is faster than Text
or ByteString
, then throw the details up on StackOverflow and we'll try to figure out why. If you perform detailed analysis and can prove that an unboxed Vector
of Word8
consistently works significantly better than Text
or ByteString
, then start conversations on mailing lists, irc, reddit, etc; the standard libraries are not set in stone, and improvements are always welcome.