从文件中优化Haskell数据读取 [英] Optimising Haskell data reading from file

查看:73
本文介绍了从文件中优化Haskell数据读取的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我试图在 3.5米线上实施科萨拉朱的图表算法文件,其中每行是两个(空格分隔的)表示图边的Ints。要开始,我需要创建一个摘要数据结构,其中包含节点和其传入和传出边缘的列表。下面的代码实现了这一点,但需要一分钟,而我可以从MOOC论坛上的帖子看到使用其他语言的人正在<10s内完成。 ( getLines 取10s,而我读到的基准测试中的取值小于1)。我是Haskell的新手并且使用 foldl''是使其终止的突破)实现了一种累积方法,但是这在风格上感觉相当迫切,我希望这就是运行缓慢的原因。此外,我目前正计划使用类似的模式来进行深度优先搜索,我担心它会变得太慢。



我找到了此介绍博客,这些话题讨论了这些问题,但在专家层面上太过专业。 / p>

 导入System.IO 
导入Control.Monad
导入Data.Map.Strict作为地图
将Data.List导入为L

类型NodeName = Int
类型Edges = [NodeName]
类型Explored = Bool

data Node = Node Explored (Edges,Edges)派生(显示)

类型Graph1 = Map NodeName节点

getLines :: FilePath - > IO [[Int]]
getLines = liftM(fmap(fmap read。words)。lines)。 readFile

getLines':: FilePath - > IO [(Int,Int)]
getLines'= liftM(fmap(tuplify2。fmap read。words)。lines)。 readFile

tuplify2 :: [a] - > (a,a)
tuplify2 [x,y] =(x,y)

main = do
list< - getLinestestdata.txt - [String ]
--list< - getLinesSCC.txt - [String]
let
list'= createGraph list
返回列表'

createGraph :: [[Int]] - > Graph1
createGraph xs = L.foldl'build Map.empty xs
其中
build :: Graph1-> [Int] - > Graph1
build = \acc(x:y:_) - >
let tmpAcc = case Map.lookup x acc
Nothing - > Map.insert x(Node False([y],[]))acc
只是 - > Map.adjust(\(Node _(fwd,bck)) - >(Node False((y:fwd),bck)))x acc
(如果Map.lookup y tmpAcc为
没有 - > Map.insert y(Node False([],[x]))tmpAcc
只是 - > Map.adjust(\(Node _(fwd,bck)) - >(Node False(fwd,(x:bck))))y tmpAcc



  • 使用

    解决方案

    code> IntMap 或 HashMap Int 键比 Map 快得多。 HashMap 通常比 IntMap 快,但使用更多内存并且库的内容较少。


  • 不要做不必要的查找。 容器包有大量的专用函数。使用 alter 与问题中的 createGraph 实现相比,查找次数可以减半。



createGraph 示例:

  import Data.List(foldl')
导入限定的Data.IntMap.Strict作为IM

类型NodeName = Int
类型Edges = [NodeName]
类型Explored = Bool

data Node = Node Explored Edges导出(Eq,Show)的边缘
类型Graph1 = IM.IntMap节点

createGraph :: [(Int,Int)] - > Graph1
createGraph xs = foldl'build IM.empty xs
where
addFwd y(Just(Node _ fb))= Just(Node False(y:f)b)
addFwd y _ = Just(Node False [y] [])
addBwd x(Just(Node _fb))= Just(Node False f(x:b))
addBwd x _ = Just节点False [] [x])

build :: Graph1 - > (Int,Int) - > Graph1
build acc(x,y)= IM.alter(addBwd x)y $ IM.alter(addFwd y)x acc



使用载体:




  • 考虑高效的构造函数(累加器,展开式, generate iterate constructN 等)。这些可能会在幕后使用突变,但比实际的可变向量使用起来要方便得多。在更一般的情况下,使用盒装向量的懒惰来构建自定义向量。

  • 尽可能使用非盒装向量。

  • 当您完全确定界限时使用不安全的函数。

  • 只有在没有纯选项时才使用可变向量。在那种情况下,更喜欢ST monad到IO。另外,请避免创建许多可变堆对象(即优先选择可变引物到不可变引用的不可变引物)。



createGraph 示例:

 将限定的Data.Vector导入为V 

类型NodeName = Int
类型Edges = [NodeName]
Explored = Bool

data Node = Node Explored Edges导出(Eq,Show)的边缘
类型Graph1 = V.Vector节点

createGraph :: Int - > [(Int,Int)] - > Graph1
createGraph maxIndex edges = graph''其中
graph = V.replicate maxIndex(Node False [] [])
graph'= V.accum(\(Node efb)x - >节点e(x:f)b)图形边缘
图形'= V.accum(\(Node efb)x→Node ef(x:b))graph'(map(\\ (a,b) - >(b,a))边缘)

在节点索引范围内的差距,那么明智的做法是在做任何其他事情之前连续重新标记索引。

  • Node 引入一个空的构造函数来表示缺少的索引。



  • 更快的I / O:




    • 使用来自 Data.Text Data.ByteString 的I​​O函数。在这两种情况下,还有用于将输入分为行或单词的有效功能。



    示例:

      import合格的Data.ByteString.Char8作为BS 
    import System.IO

    getLines :: FilePath - > IO [(Int,Int)]
    getLines path = do
    lines < - (map BS.words。BS.lines)`fmap` BS.readFile path
    let pairs =(map (可能(错误无法读取Int)fst。BS.readInt)行
    return [(a,b)| [a,b] < - 对]



    标杆:



    总是这样做,不像我在这个答案。使用 条件

    I am trying to implement Kosaraju's graph algorithm, on a 3.5m line file where each row is two (space separated) Ints representing a graph edge. To start I need to create a summary data structure that has the node and lists of its incoming and outgoing edges. The code below achieves that, but takes over a minute, whereas I can see from posts on the MOOC forum that people using other languages are completing in <<10s. (getLines is taking 10s compared to under 1s in benchmarks I read about.)

    I'm new to Haskell and have implemented an accumulation method using foldl' (the ' was a breakthrough in making it terminate at all), but it feels rather imperative in style, and I'm hoping that that's the reason why it is running slow. Moreover, I'm currently planning to use a similar pattern to conduct the depth-first-search, and I fear it will all just become too slow.

    I have found this presentation and blog that talk about these sort of issues but at too expert a level.

    import System.IO
    import Control.Monad
    import Data.Map.Strict as Map
    import Data.List as L
    
    type NodeName = Int
    type Edges = [NodeName]
    type Explored = Bool
    
    data Node = Node Explored (Edges, Edges) deriving (Show)
    
    type Graph1 = Map NodeName Node
    
    getLines :: FilePath -> IO [[Int]]
    getLines = liftM (fmap (fmap read . words) . lines) . readFile
    
    getLines' :: FilePath -> IO [(Int,Int)]
    getLines' = liftM (fmap (tuplify2 . fmap read . words) . lines) . readFile
    
    tuplify2 :: [a] -> (a,a)
    tuplify2 [x,y] = (x,y)
    
    main = do
        list <- getLines "testdata.txt"  -- [String]
        --list <- getLines "SCC.txt"  -- [String]   
        let
            list' = createGraph list
        return list'
    
    createGraph :: [[Int]] -> Graph1
    createGraph xs = L.foldl' build Map.empty xs
        where
            build :: Graph1-> [Int] -> Graph1
            build = \acc (x:y:_) ->
                let tmpAcc = case Map.lookup x acc of
                    Nothing -> Map.insert x (Node False ([y],[])) acc
                    Just a -> Map.adjust (\(Node _ (fwd, bck)) -> (Node False ((y:fwd), bck))) x acc
                in case Map.lookup y tmpAcc of
                    Nothing -> Map.insert y (Node False ([],[x])) tmpAcc
                    Just a -> Map.adjust (\(Node _ (fwd, bck)) -> (Node False (fwd, (x:bck)))) y tmpAcc
    

    解决方案

    Using maps:

    • Use IntMap or HashMap when possible. Both are significantly faster for Int keys than Map. HashMap is usually faster than IntMap but uses more RAM and has a less rich library.
    • Don't do unnecessary lookups. The containers package has a large number of specialized functions. With alter the number of lookups can be halved compared to the createGraph implementation in the question.

    Example for createGraph:

    import Data.List (foldl')
    import qualified Data.IntMap.Strict as IM
    
    type NodeName = Int
    type Edges = [NodeName]
    type Explored = Bool
    
    data Node = Node Explored Edges Edges deriving (Eq, Show)
    type Graph1 = IM.IntMap Node
    
    createGraph :: [(Int, Int)] -> Graph1
    createGraph xs = foldl' build IM.empty xs
        where
            addFwd y (Just (Node _ f b)) = Just (Node False (y:f) b)
            addFwd y _                   = Just (Node False [y] [])
            addBwd x (Just (Node _ f b)) = Just (Node False f (x:b))
            addBwd x _                   = Just (Node False [] [x])
    
            build :: Graph1 -> (Int, Int) -> Graph1
            build acc (x, y) = IM.alter (addBwd x) y $ IM.alter (addFwd y) x acc 
    

    Using vectors:

    • Consider the efficient construction functions (the accumulators, unfolds, generate, iterate, constructN, etc.). These may use mutation behind the scenes but are considerably more convenient to use than actual mutable vectors.
    • In the more general case, use the laziness of boxed vectors to enable self-reference when constructing a vector.
    • Use unboxed vectors when possible.
    • Use unsafe functions when you're absolutely sure about the bounds.
    • Only use mutable vectors when there aren't pure alternatives. In that case, prefer the ST monad to IO. Also, avoid creating many mutable heap objects (i. e. prefer mutable vectors to immutable vectors of mutable references).

    Example for createGraph:

    import qualified Data.Vector as V
    
    type NodeName = Int
    type Edges = [NodeName]
    type Explored = Bool
    
    data Node = Node Explored Edges Edges deriving (Eq, Show)
    type Graph1 = V.Vector Node
    
    createGraph :: Int -> [(Int, Int)] -> Graph1
    createGraph maxIndex edges = graph'' where
        graph    = V.replicate maxIndex (Node False [] [])
        graph'   = V.accum (\(Node e f b) x -> Node e (x:f) b) graph  edges
        graph''  = V.accum (\(Node e f b) x -> Node e f (x:b)) graph' (map (\(a, b) -> (b, a)) edges)
    

    Note that if there are gaps in the range of the node indices, then it'd be wise to either

    1. Contiguously relabel the indices before doing anything else.
    2. Introduce an empty constructor to Node to signify a missing index.

    Faster I/O:

    • Use the IO functions from Data.Text or Data.ByteString. In both cases there are also efficient functions for breaking input into lines or words.

    Example:

    import qualified Data.ByteString.Char8 as BS
    import System.IO
    
    getLines :: FilePath -> IO [(Int, Int)]
    getLines path = do
        lines <- (map BS.words . BS.lines) `fmap` BS.readFile path
        let pairs = (map . map) (maybe (error "can't read Int") fst . BS.readInt) lines
        return [(a, b) | [a, b] <- pairs]
    

    Benchmarking:

    Always do it, unlike me in this answer. Use criterion.

    这篇关于从文件中优化Haskell数据读取的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

    查看全文
    登录 关闭
    扫码关注1秒登录
    发送“验证码”获取 | 15天全站免登陆