字符的频率 [英] Frequency of characters
问题描述
我正在使用 Haskell 查找文件中字符的频率。我希望能够处理〜500MB大小的文件。
我到现在为止所尝试的内容
-
它可以完成这项工作,但速度有点慢,因为它将文件解析为256次
calculateFrequency :: L.ByteString - > [(Word8,Int64)]
calculateFrequency f = foldl(\acc x - >(x,L.count xf):acc)[] [255,254 .. 0]
-
我也尝试过使用Data.Map,但程序内存不足(在ghc解释器中)。
导入限定的Data.ByteString.Lazy为L
将限定的Data.Map导入为M
calculateFrequency ':: L.ByteString - > [(Word8,Int64)]
calculateFrequency'xs = M.toList $ L.foldl'(\ m word - > M.insertWith(+)word 1 m)(M.empty)xs
实现使用可变的,未装箱的向量而不是更高层次的构造。它还使用 conduit
来读取文件以避免惰性I / O。
导入Control.Monad.IO.Class
将合格的Data.ByteString导入为S
导入Data.Conduit
导入Data.Conduit.Binary作为CB
导入合格的Data.Conduit .List as CL
将合格Data.Vector.Unboxed.Mutable作为VM
导入Data.Word(Word8)
类型Freq = VM.IOVector Int
newFreq :: MonadIO m => m Freq
newFreq = liftIO $ VM.replicate 256 0
printFreq :: MonadIO m =>频率 - > m()
printFreq freq =
liftIO $ mapM_ go [0..255]
where
go i = do
x< - VM.read freq i
putStrLn $ show i ++:++ show x
addFreqWord8 :: MonadIO m =>频率 - > Word8 - > m()
addFreqWord8 fw = liftIO $ do
let index = fromIntegral w
oldCount< - VM.read f index
VM.write f index(oldCount + 1)
addFreqBS :: MonadIO m =>频率 - > S.ByteString - > m()
addFreqBS f bs =
loop(S.length bs - 1)
where
loop(-1)= return()
loop i = do
addFreqWord8 f(S.index bs i)
loop(i - 1)
- |主要的入口点。
main :: IO()
main = do
freq< - newFreq
runResourceT
$ sourceFilerandom
$$ CL.mapM_( addFreqBS freq)
printFreq freq
我在500MB的随机数据上运行这个函数, josejuan的基于UArray的答案:
- 基于导管的/可变的向量:1.006s
- UArray:17.962
我认为应该有可能保留josejuan的高级方法的优雅,同时保持可变向量的速度实现,但我还没有机会尝试实现类似的东西呢。另外,请注意,使用一些通用帮助器函数(如Data.ByteString.mapM或Data.Conduit.Binary.mapM),实现可能会显着简化,而不会影响性能。
编辑:我将其中一个缺失的函数添加到 conduit
并清理稍微提一下代码;它现在看起来像下面这样:
import Control.Monad.Trans.Class(lift)
import Data.ByteString (ByteString)
import Data.Conduit(Consumer,($$))
将合格的Data.Conduit.Binary导入为CB
将合格的Data.Vector.Unboxed导入为V
导入合格的Data.Vector.Unboxed.Mutable为VM
import System.IO(stdin)
freqSink :: Consumer ByteString IO(V.Vector Int)
freqSink = do
freq< - lift $ VM.replicate 256 0
CB.mapM_ $ \w - >做
let index = fromIntegral w
oldCount< - VM.read freq index
VM.write freq index(oldCount + 1)
lift $ V.freeze freq
main :: IO()
main =(CB.sourceHandle stdin $$ freqSink)>> = print
功能唯一的区别在于频率是如何打印的。
I am trying to find frequency of characters in file using Haskell. I want to be able to handle files ~500MB size.
What I've tried till now
It does the job but is a bit slow as it parses the file 256 times
calculateFrequency :: L.ByteString -> [(Word8, Int64)] calculateFrequency f = foldl (\acc x -> (x, L.count x f):acc) [] [255, 254.. 0]
I have also tried using Data.Map but the program runs out of memory (in ghc interpreter).
import qualified Data.ByteString.Lazy as L import qualified Data.Map as M calculateFrequency' :: L.ByteString -> [(Word8, Int64)] calculateFrequency' xs = M.toList $ L.foldl' (\m word -> M.insertWith (+) word 1 m) (M.empty) xs
Here's an implementation using mutable, unboxed vectors instead of higher level constructs. It also uses conduit
for reading the file to avoid lazy I/O.
import Control.Monad.IO.Class
import qualified Data.ByteString as S
import Data.Conduit
import Data.Conduit.Binary as CB
import qualified Data.Conduit.List as CL
import qualified Data.Vector.Unboxed.Mutable as VM
import Data.Word (Word8)
type Freq = VM.IOVector Int
newFreq :: MonadIO m => m Freq
newFreq = liftIO $ VM.replicate 256 0
printFreq :: MonadIO m => Freq -> m ()
printFreq freq =
liftIO $ mapM_ go [0..255]
where
go i = do
x <- VM.read freq i
putStrLn $ show i ++ ": " ++ show x
addFreqWord8 :: MonadIO m => Freq -> Word8 -> m ()
addFreqWord8 f w = liftIO $ do
let index = fromIntegral w
oldCount <- VM.read f index
VM.write f index (oldCount + 1)
addFreqBS :: MonadIO m => Freq -> S.ByteString -> m ()
addFreqBS f bs =
loop (S.length bs - 1)
where
loop (-1) = return ()
loop i = do
addFreqWord8 f (S.index bs i)
loop (i - 1)
-- | The main entry point.
main :: IO ()
main = do
freq <- newFreq
runResourceT
$ sourceFile "random"
$$ CL.mapM_ (addFreqBS freq)
printFreq freq
I ran this on 500MB of random data and compared with @josejuan's UArray-based answer:
- conduit based/mutable vectors: 1.006s
- UArray: 17.962s
I think it should be possible to keep much of the elegance of josejuan's high-level approach yet keep the speed of the mutable vector implementation, but I haven't had a chance to try implementing something like that yet. Also, note that with some general purpose helper functions (like Data.ByteString.mapM or Data.Conduit.Binary.mapM) the implementation could be significantly simpler without affecting performance.
You can play with this implementation on FP Haskell Center as well.
EDIT: I added one of those missing functions to conduit
and cleaned up the code a bit; it now looks like the following:
import Control.Monad.Trans.Class (lift)
import Data.ByteString (ByteString)
import Data.Conduit (Consumer, ($$))
import qualified Data.Conduit.Binary as CB
import qualified Data.Vector.Unboxed as V
import qualified Data.Vector.Unboxed.Mutable as VM
import System.IO (stdin)
freqSink :: Consumer ByteString IO (V.Vector Int)
freqSink = do
freq <- lift $ VM.replicate 256 0
CB.mapM_ $ \w -> do
let index = fromIntegral w
oldCount <- VM.read freq index
VM.write freq index (oldCount + 1)
lift $ V.freeze freq
main :: IO ()
main = (CB.sourceHandle stdin $$ freqSink) >>= print
The only difference in functionality is how the frequency is printed.
这篇关于字符的频率的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!