字符的频率 [英] Frequency of characters

查看：215 发布时间：2018/6/4 17:36:05 haskell bytestring

本文介绍了字符的频率的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在使用 Haskell 查找文件中字符的频率。我希望能够处理〜500MB大小的文件。

我到现在为止所尝试的内容

它可以完成这项工作，但速度有点慢，因为它将文件解析为256次

  calculateFrequency :: L.ByteString  - > [（Word8，Int64）] 
 calculateFrequency f = foldl（\acc x  - >（x，L.count xf）：acc）[] [255,254 .. 0]

我也尝试过使用Data.Map，但程序内存不足（在ghc解释器中）。导入限定的Data.ByteString.Lazy为L 将限定的Data.Map导入为M calculateFrequency ':: L.ByteString - > [（Word8，Int64）] calculateFrequency'xs = M.toList $ L.foldl'（\ m word - > M.insertWith（+）word 1 m）（M.empty）xs

解决方案

实现使用可变的，未装箱的向量而不是更高层次的构造。它还使用 conduit 来读取文件以避免惰性I / O。

 导入Control.Monad.IO.Class 
将合格的Data.ByteString导入为S 
导入Data.Conduit 
导入Data.Conduit.Binary作为CB 
导入合格的Data.Conduit .List as CL 
将合格Data.Vector.Unboxed.Mutable作为VM 
导入Data.Word（Word8）
 
类型Freq = VM.IOVector Int 
 
 newFreq :: MonadIO m => m Freq 
 newFreq = liftIO $ VM.replicate 256 0 
 
 printFreq :: MonadIO m =>频率 - > m（）
 printFreq freq = 
 liftIO $ mapM_ go [0..255] 
 where 
 go i = do 
x<  -  VM.read freq i 
 putStrLn $ show i ++：++ show x 
 
 addFreqWord8 :: MonadIO m =>频率 - > Word8  - > m（）
 addFreqWord8 fw = liftIO $ do 
 let index = fromIntegral w 
 oldCount<  -  VM.read f index 
 VM.write f index（oldCount + 1）
 
 addFreqBS :: MonadIO m =>频率 - > S.ByteString  - > m（）
 addFreqBS f bs = 
 loop（S.length bs  -  1）
 where 
 loop（-1）= return（）
 loop i = do 
 addFreqWord8 f（S.index bs i）
 loop（i  -  1）
 
  -  |主要的入口点。 
 main :: IO（）
 main = do 
 freq<  -  newFreq 
 runResourceT 
 $ sourceFilerandom
 $$ CL.mapM_（ addFreqBS freq）
 printFreq freq

我在500MB的随机数据上运行这个函数， josejuan的基于UArray的答案：

基于导管的/可变的向量：1.006s UArray：17.962

我认为应该有可能保留josejuan的高级方法的优雅，同时保持可变向量的速度实现，但我还没有机会尝试实现类似的东西呢。另外，请注意，使用一些通用帮助器函数（如Data.ByteString.mapM或Data.Conduit.Binary.mapM），实现可能会显着简化，而不会影响性能。您可以在FP Haskell Center上使用此实现作为编辑：我将其中一个缺失的函数添加到 conduit 并清理稍微提一下代码;它现在看起来像下面这样： import Control.Monad.Trans.Class（lift） import Data.ByteString （ByteString） import Data.Conduit（Consumer，（$$））将合格的Data.Conduit.Binary导入为CB 将合格的Data.Vector.Unboxed导入为V 导入合格的Data.Vector.Unboxed.Mutable为VM import System.IO（stdin） freqSink :: Consumer ByteString IO（V.Vector Int） freqSink = do freq< - lift $ VM.replicate 256 0 CB.mapM_ $ \w - >做 let index = fromIntegral w oldCount< - VM.read freq index VM.write freq index（oldCount + 1） lift $ V.freeze freq main :: IO（） main =（CB.sourceHandle stdin $$ freqSink）>> = print 功能唯一的区别在于频率是如何打印的。 I am trying to find frequency of characters in file using Haskell. I want to be able to handle files ~500MB size. What I've tried till now It does the job but is a bit slow as it parses the file 256 times calculateFrequency :: L.ByteString -> [(Word8, Int64)] calculateFrequency f = foldl (\acc x -> (x, L.count x f):acc) [] [255, 254.. 0] I have also tried using Data.Map but the program runs out of memory (in ghc interpreter). import qualified Data.ByteString.Lazy as L import qualified Data.Map as M calculateFrequency' :: L.ByteString -> [(Word8, Int64)] calculateFrequency' xs = M.toList $ L.foldl' (\m word -> M.insertWith (+) word 1 m) (M.empty) xs 解决方案 Here's an implementation using mutable, unboxed vectors instead of higher level constructs. It also uses conduit for reading the file to avoid lazy I/O. import Control.Monad.IO.Class import qualified Data.ByteString as S import Data.Conduit import Data.Conduit.Binary as CB import qualified Data.Conduit.List as CL import qualified Data.Vector.Unboxed.Mutable as VM import Data.Word (Word8) type Freq = VM.IOVector Int newFreq :: MonadIO m => m Freq newFreq = liftIO $ VM.replicate 256 0 printFreq :: MonadIO m => Freq -> m () printFreq freq = liftIO $ mapM_ go [0..255] where go i = do x <- VM.read freq i putStrLn $ show i ++ ": " ++ show x addFreqWord8 :: MonadIO m => Freq -> Word8 -> m () addFreqWord8 f w = liftIO $ do let index = fromIntegral w oldCount <- VM.read f index VM.write f index (oldCount + 1) addFreqBS :: MonadIO m => Freq -> S.ByteString -> m () addFreqBS f bs = loop (S.length bs - 1) where loop (-1) = return () loop i = do addFreqWord8 f (S.index bs i) loop (i - 1) -- | The main entry point. main :: IO () main = do freq <- newFreq runResourceT $ sourceFile "random" $$ CL.mapM_ (addFreqBS freq) printFreq freq I ran this on 500MB of random data and compared with @josejuan's UArray-based answer: conduit based/mutable vectors: 1.006s UArray: 17.962s I think it should be possible to keep much of the elegance of josejuan's high-level approach yet keep the speed of the mutable vector implementation, but I haven't had a chance to try implementing something like that yet. Also, note that with some general purpose helper functions (like Data.ByteString.mapM or Data.Conduit.Binary.mapM) the implementation could be significantly simpler without affecting performance. You can play with this implementation on FP Haskell Center as well. EDIT: I added one of those missing functions to conduit and cleaned up the code a bit; it now looks like the following: import Control.Monad.Trans.Class (lift) import Data.ByteString (ByteString) import Data.Conduit (Consumer, ($$)) import qualified Data.Conduit.Binary as CB import qualified Data.Vector.Unboxed as V import qualified Data.Vector.Unboxed.Mutable as VM import System.IO (stdin) freqSink :: Consumer ByteString IO (V.Vector Int) freqSink = do freq <- lift $ VM.replicate 256 0 CB.mapM_ $ \w -> do let index = fromIntegral w oldCount <- VM.read freq index VM.write freq index (oldCount + 1) lift $ V.freeze freq main :: IO () main = (CB.sourceHandle stdin $$ freqSink) >>= print The only difference in functionality is how the frequency is printed. 这篇关于字符的频率的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

字符的频率 [英] Frequency of characters

问题描述

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

字符的频率 [英] Frequency of characters

问题描述

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭