字符的频率 [英] Frequency of characters

查看:215
本文介绍了字符的频率的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用 Haskell 查找文件中字符的频率。我希望能够处理〜500MB大小的文件。



我到现在为止所尝试的内容


  1. 它可以完成这项工作,但速度有点慢,因为它将文件解析为256次

      calculateFrequency :: L.ByteString  - > [(Word8,Int64)] 
    calculateFrequency f = foldl(\acc x - >(x,L.count xf):acc)[] [255,254 .. 0]


  2. 我也尝试过使用Data.Map,但程序内存不足(在ghc解释器中)。

     导入限定的Data.ByteString.Lazy为L 
    将限定的Data.Map导入为M

    calculateFrequency ':: L.ByteString - > [(Word8,Int64)]
    calculateFrequency'xs = M.toList $ L.foldl'(\ m word - > M.insertWith(+)word 1 m)(M.empty)xs



解决方案

实现使用可变的,未装箱的向量而不是更高层次的构造。它还使用 conduit 来读取文件以避免惰性I / O。

 导入Control.Monad.IO.Class 
将合格的Data.ByteString导入为S
导入Data.Conduit
导入Data.Conduit.Binary作为CB
导入合格的Data.Conduit .List as CL
将合格Data.Vector.Unboxed.Mutable作为VM
导入Data.Word(Word8)

类型Freq = VM.IOVector Int

newFreq :: MonadIO m => m Freq
newFreq = liftIO $ VM.replicate 256 0

printFreq :: MonadIO m =>频率 - > m()
printFreq freq =
liftIO $ mapM_ go [0..255]
where
go i = do
x< - VM.read freq i
putStrLn $ show i ++:++ show x

addFreqWord8 :: MonadIO m =>频率 - > Word8 - > m()
addFreqWord8 fw = liftIO $ do
let index = fromIntegral w
oldCount< - VM.read f index
VM.write f index(oldCount + 1)

addFreqBS :: MonadIO m =>频率 - > S.ByteString - > m()
addFreqBS f bs =
loop(S.length bs - 1)
where
loop(-1)= return()
loop i = do
addFreqWord8 f(S.index bs i)
loop(i - 1)

- |主要的入口点。
main :: IO()
main = do
freq< - newFreq
runResourceT
$ sourceFilerandom
$$ CL.mapM_( addFreqBS freq)
printFreq freq

我在500MB的随机数据上运行这个函数, josejuan的基于UArray的答案:


  • 基于导管的/可变的向量:1.006s
  • UArray:17.962



我认为应该有可能保留josejuan的高级方法的优雅,同时保持可变向量的速度实现,但我还没有机会尝试实现类似的东西呢。另外,请注意,使用一些通用帮助器函数(如Data.ByteString.mapM或Data.Conduit.Binary.mapM),实现可能会显着简化,而不会影响性能。



您可以在FP Haskell Center上使用此实现作为

编辑:我将其中一个缺失的函数添加到 conduit 并清理稍微提一下代码;它现在看起来像下面这样:

  import Control.Monad.Trans.Class(lift)
import Data.ByteString (ByteString)
import Data.Conduit(Consumer,($$))
将合格的Data.Conduit.Binary导入为CB
将合格的Data.Vector.Unboxed导入为V
导入合格的Data.Vector.Unboxed.Mutable为VM
import System.IO(stdin)

freqSink :: Consumer ByteString IO(V.Vector Int)
freqSink = do
freq< - lift $ VM.replicate 256 0
CB.mapM_ $ \w - >做
let index = fromIntegral w
oldCount< - VM.read freq index
VM.write freq index(oldCount + 1)
lift $ V.freeze freq

main :: IO()
main =(CB.sourceHandle stdin $$ freqSink)>> = print

功能唯一的区别在于频率是如何打印的。

I am trying to find frequency of characters in file using Haskell. I want to be able to handle files ~500MB size.

What I've tried till now

  1. It does the job but is a bit slow as it parses the file 256 times

    calculateFrequency :: L.ByteString -> [(Word8, Int64)]
    calculateFrequency f = foldl (\acc x -> (x, L.count x f):acc) [] [255, 254.. 0]
    

  2. I have also tried using Data.Map but the program runs out of memory (in ghc interpreter).

    import qualified Data.ByteString.Lazy as L
    import qualified Data.Map as M
    
    calculateFrequency' :: L.ByteString -> [(Word8, Int64)]
    calculateFrequency' xs = M.toList $ L.foldl' (\m word -> M.insertWith (+) word 1 m) (M.empty) xs
    

解决方案

Here's an implementation using mutable, unboxed vectors instead of higher level constructs. It also uses conduit for reading the file to avoid lazy I/O.

import           Control.Monad.IO.Class
import qualified Data.ByteString             as S
import           Data.Conduit
import           Data.Conduit.Binary         as CB
import qualified Data.Conduit.List           as CL
import qualified Data.Vector.Unboxed.Mutable as VM
import           Data.Word                   (Word8)

type Freq = VM.IOVector Int

newFreq :: MonadIO m => m Freq
newFreq = liftIO $ VM.replicate 256 0

printFreq :: MonadIO m => Freq -> m ()
printFreq freq =
    liftIO $ mapM_ go [0..255]
  where
    go i = do
        x <- VM.read freq i
        putStrLn $ show i ++ ": " ++ show x

addFreqWord8 :: MonadIO m => Freq -> Word8 -> m ()
addFreqWord8 f w = liftIO $ do
    let index = fromIntegral w
    oldCount <- VM.read f index
    VM.write f index (oldCount + 1)

addFreqBS :: MonadIO m => Freq -> S.ByteString -> m ()
addFreqBS f bs =
    loop (S.length bs - 1)
  where
    loop (-1) = return ()
    loop i = do
        addFreqWord8 f (S.index bs i)
        loop (i - 1)

-- | The main entry point.
main :: IO ()
main = do
    freq <- newFreq
    runResourceT
        $  sourceFile "random"
        $$ CL.mapM_ (addFreqBS freq)
    printFreq freq

I ran this on 500MB of random data and compared with @josejuan's UArray-based answer:

  • conduit based/mutable vectors: 1.006s
  • UArray: 17.962s

I think it should be possible to keep much of the elegance of josejuan's high-level approach yet keep the speed of the mutable vector implementation, but I haven't had a chance to try implementing something like that yet. Also, note that with some general purpose helper functions (like Data.ByteString.mapM or Data.Conduit.Binary.mapM) the implementation could be significantly simpler without affecting performance.

You can play with this implementation on FP Haskell Center as well.

EDIT: I added one of those missing functions to conduit and cleaned up the code a bit; it now looks like the following:

import           Control.Monad.Trans.Class   (lift)
import           Data.ByteString             (ByteString)
import           Data.Conduit                (Consumer, ($$))
import qualified Data.Conduit.Binary         as CB
import qualified Data.Vector.Unboxed         as V
import qualified Data.Vector.Unboxed.Mutable as VM
import           System.IO                   (stdin)

freqSink :: Consumer ByteString IO (V.Vector Int)
freqSink = do
    freq <- lift $ VM.replicate 256 0
    CB.mapM_ $ \w -> do
        let index = fromIntegral w
        oldCount <- VM.read freq index
        VM.write freq index (oldCount + 1)
    lift $ V.freeze freq

main :: IO ()
main = (CB.sourceHandle stdin $$ freqSink) >>= print

The only difference in functionality is how the frequency is printed.

这篇关于字符的频率的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆