大型向量优化操纵 [英] Optimising manipulation of large vectors

查看:119
本文介绍了大型向量优化操纵的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

这是一个跟进我的<一个href="http://stackoverflow.com/questions/24278006/need-advice-on-optimising-haskell-data-processing">$p$pvious问题如何处理5.1米边缘的载体重新presentation向图。我想实现Kosaraju的图形算法,因此需要重新安排我的上边缘的深度优先搜索(DFS)的结束时间的顺序颠倒载体。我有一个运行在小数据集但未能在10分钟内返回的整个数据集code。 (我不能排除一个循环产生于大图,但也有是在我的测试数据的迹象。)

DFS需要避免重访的节点,所以我需要某种形式的国家的搜索(目前一个元组,我应该用一个国家单子?的)。第一个搜索应该返回重新排序的载体,但我保持东西present简单通过返回重新排序节点索引列表,这样我就可以处理向量在一个后来去

我presume在 dfsInner 的问题所在。下面的code'记住'节点访问更新每个节点(第三控卫)的探索领域。虽然我试图使它尾递归中,code似乎较快增长的内存使用。 我是否需要实施一些严格的,如果是这样,怎么样? (我有我的一个单一的搜索,它通过查看堆栈和节点已完成了名单上的未知边缘的开始节点检查previous访问使用另一种版本。这不会长得这么很快,但不返回任何井连接的节点。)

不过,这也可能是 foldr相似,而我怎么能检测的?

这是假想Coursera的功课,但我不再相信我可以勾选荣幸code按钮!学习是更重要的,所以我真的不希望复制/粘贴的答案。我拥有的是不是很优雅 - 它有一个迫切的感觉也一样,这是与保持某种状态的驱动问题 - 见第三控卫。我欢迎评论设计模式。

 键入节点名= INT
形边= [节点名]
探索型=布尔
键入堆栈= [(智力,智力)]

数据节点=节点节点名边探索边导出(均衡器,显示)
类型图=矢量节点

主要=做
    边&LT;  -  V.fromList`fmap` getEdgesSCC.txt
    让
        maxIndex = FST $ V.last边缘
        GR = createGraph maxIndex边缘
        RES = dfsOuter克
    --return克
    putStrLn $展水库

dfsOuter克=
    让TMP = V.foldr'callInner(GR,[])克
    在SND TMP

callInner ::节点 - &GT; (图,堆栈) - &GT; (图,堆栈)
callInner(节点IDX _ FWD BWD)(GR,ACC)=
    让(节点探索_ _ _)=克五! IDX
    如果探讨了
        真 - &GT; (GR,ACC)
        假 - &GT;
            让
                initialStack =地图(\ L  - &GT;(IDX,1))BWD
                GR = GR V // [(IDX,节点IDX真FWD BWD)
                (GR',newScc)= dfsInner IDX initialStack(长ACC)(GR',[])
            在(克',newScc ++ ACC)

dfsInner ::节点名 - &GT;堆栈 - &GT; INT  - &GT; (图,[(智力,智力)]) - &GT; (图,[(智力,智力)])
dfsInner启动[] finishCounter(GR,ACC)=(GR,(开始,finishCounter):ACC)
dfsInner启动堆栈finishCounter(GR,ACC)
    | nextStart / =启动=  - 没有更多的地方,从这个节点去
        dfsInner nextStart栈(finishCounter + 1)$(GR,(开始,finishCounter):ACC)
    | nextExplored =
 -  nextExplored ||任何(\(Y,_) - &GT; Y == stack0Head)堆栈||任何(\(X,_) -  GT,X == stack0Head)ACC =
        dfsInner启动(尾堆)finishCounter(GR,ACC)
    |否则=
        dfsInner nextEnd(add2Stack ++堆栈)finishCounter(GR V // [(nextEnd,节点IDX真nextLHS nextRHS),ACC)
 -  dfsInner克stack0Head(add2Stack ++堆栈)finishCounter ACC

    哪里
        (nextStart,nextEnd)=头堆
        (节点IDX nextExplored nextLHS nextRHS)=克五! nextEnd
        add2Stack =地图(\ L  - &GT;(nextEnd,1))nextRHS
 

解决方案

根据@andras要点,我改写了我的code如下。我没有用箭头的功能,因为我不熟悉他们,我的第二个深度优先搜索文体上是一样的第一个(而不是@Andras filterM方法)。最终的结果是,它的安德拉什'code(-21的替代114S)20%的时间完成。

 进口资质Data.Vector为V
进口资质Data.IntSet原样
进口资质Data.ByteString.Char8为BS
进口Data.List模块
进口Control.Monad
进口Control.Monad.State
--import Criterion.Main

--getEdges ::串 - &GT; IO [(智力,智力)]
getEdges文件=做
    线&LT;  - (图BS.words BS.lines)`fmap` BS.readFile文件
    让
        对=(地图。地图)(也许(错误无法读取内部)FST。BS.readInt)线
        对'= [(A,B)| [A,B]其中, - 对 - 增加了9秒
        maxIndex = FST $最后对'
        图= createGraph maxIndex对'
    回报图

主要=做
    图&LT;  -  getEdgesSCC.txt
     - 让
        --maxIndex = FST $ V.last边缘
    让
        FTS = bwdLoop图
        领导= FST $ execState(fwdLoop图FTS)([],IS.empty)
    打印$长度领袖

型连接= [INT]
数据节点=节点{FWD,BWD ::连接}导出(显示)
类型图= V.Vector节点

访问类型= IS.IntSet
键入FinishTime = INT
类型FinishTimes = [FinishTime]
型领导= [INT]

createGraph ::诠释 - &GT; [(智力,智力)]  - &GT;图形
createGraph maxIndex对=
    让
        图表= V.replicate(maxIndex + 1)(节点[] [])
        图表'= V.accum(\(节点FB)× - &gt;节点(X:F)B)曲线图对
    在V.accum(\(节点FB)× - &gt;节点F(X:b))的曲线图'$地图(\(A,B) - &GT;(B,A))对

bwdLoop ::图 - &GT; FinishTimes
bwdLoop G = FST $ execState(mapM_去$反向[0 .. V.length克 -  1])([],IS.empty),其中
    去::诠释 - &GT;国家(FinishTimes,访问)()
    去我=做
        (fTimes,VS)&LT;  - 获取
        让参观= IS.member我VS
        如果没有参观过那么做
            把(fTimes,IS.insert我VS)
            mapM_去$ BWD $ G五!一世
             - 从mapM_修改后,再次获得国家
            (fTimes',VS')&LT;  - 获取
            把(我:fTimes,VS')
        否则返回()

fwdLoop ::图 - &GT; FinishTimes  - &GT;国家(领导人,访问)()
fwdLoop _ [] =收益率()
fwdLoop G(I:FTS)= DO
    (LS,VS)&LT;  - 获取
    让参观= IS.member我VS
    如果没有参观过那么做
        把(我:LS,IS.insert我VS)
        mapM_去$ FWD $ G五!一世
    否则返回()
    fwdLoop克FTS

    哪里
        去::诠释 - &GT;国家(领导人,访问)()
        去我=做
            (LS,VS)&LT;  - 获取
            让参观= IS.member我VS
            如果没有参观过那么做
                把(LS,IS.insert我VS)
                mapM_去$ FWD $ G五!一世
            否则返回()
 

This is a follow up to my previous question about processing a Vector representation of a 5.1m edge directed graph. I am trying to implement Kosaraju's graph algorithm and thus need to rearrange my Vector in the order of the finishing times of a depth first search (DFS) on the edges reversed. I have code that runs on small data sets but that fails to return in 10 minutes on the full data set. (I can't exclude that a loop arises from the big graph, but there are no signs of that on my test data.)

DFS needs to avoid revisiting nodes, so I need some sort of 'state' for the search (currently a tuple, should I use a State Monad?). The first search should return a reordered Vector, but I am keeping things simple at present by returning a list of the reordered Node indexes so that I can process the Vector in one go subsequently.

I presume the issue lies in dfsInner. The code below 'remembers' the nodes visited updating the explored field of each node (third guard). Although I tried to make it tail recursive, the code seems to grow memory use fairly fast. Do I need to enforce some strictness and if so, how? (I have another version that I use on a single search search, which checks for previous visits by looking at the start nodes of the unexplored edges on the stack and the list of nodes that have been completed. This does not grow so quickly, but does not return for any well connected node.)

However, it could also be the foldr', but how can I detect that?

This is supposedly Coursera homework, but I'm no longer sure I can tick the honour code button! Learning is more important though, so I don't really want a copy/paste answer. What I have is not very elegant - it has an imperative feel to it too, which is driven by the issue with keeping some sort of state - see third guard. I'd welcome comments on design patterns.

type NodeName = Int
type Edges    = [NodeName]
type Explored = Bool
type Stack    = [(Int, Int)]

data Node  = Node NodeName Explored Edges Edges deriving (Eq, Show)
type Graph = Vector Node

main = do
    edges <- V.fromList `fmap` getEdges "SCC.txt"
    let 
        maxIndex = fst $ V.last edges
        gr = createGraph maxIndex edges
        res = dfsOuter gr
    --return gr
    putStrLn $ show res

dfsOuter gr = 
    let tmp = V.foldr' callInner (gr,[]) gr
    in snd tmp

callInner :: Node -> (Graph, Stack) -> (Graph, Stack)
callInner (Node idx _ fwd bwd) (gr,acc) = 
    let (Node _ explored _ _) = gr V.! idx
    in case explored of
        True  -> (gr, acc)
        False ->
            let
                initialStack = map (\l -> (idx, l)) bwd
                gr' = gr V.// [(idx, Node idx True fwd bwd)]
                (gr'', newScc) = dfsInner idx initialStack (length acc) (gr', [])
            in (gr'', newScc++acc)

dfsInner :: NodeName -> Stack -> Int -> (Graph, [(Int, Int)]) -> (Graph, [(Int, Int)])
dfsInner start [] finishCounter (gr, acc) = (gr, (start, finishCounter):acc)
dfsInner start stack finishCounter (gr, acc)
    | nextStart /= start =                      -- no more places to go from this node
        dfsInner nextStart stack (finishCounter + 1) $ (gr, (start, finishCounter):acc)
    | nextExplored = 
-- nextExplored || any (\(y,_) -> y == stack0Head) stack || any (\(x,_) -> x == stack0Head) acc =
        dfsInner start (tail stack) finishCounter (gr, acc)
    | otherwise =
        dfsInner nextEnd (add2Stack++stack) finishCounter (gr V.// [(nextEnd, Node idx True nextLHS nextRHS)], acc)
--      dfsInner gr stack0Head (add2Stack++stack) finishCounter acc

    where
        (nextStart, nextEnd) = head stack
        (Node idx nextExplored nextLHS nextRHS) = gr V.! nextEnd
        add2Stack = map (\l -> (nextEnd, l)) nextRHS

解决方案

Based on @andras gist, I rewrote my code as below. I did not use Arrow functions as I am unfamiliar with them, and my second depth first search is stylistically the same as the first one (instead of @Andras filterM approach). The end result is that it completes in 20% of the time of Andras' code (21s instead of 114s).

import qualified Data.Vector as V
import qualified Data.IntSet as IS
import qualified Data.ByteString.Char8 as BS
import Data.List
import Control.Monad
import Control.Monad.State
--import Criterion.Main

--getEdges :: String -> IO [(Int, Int)]
getEdges file = do
    lines <- (map BS.words . BS.lines) `fmap` BS.readFile file
    let 
        pairs = (map . map) (maybe (error "can't read Int") fst . BS.readInt) lines
        pairs' = [(a, b) | [a, b] <- pairs]         -- adds 9 seconds
        maxIndex = fst $ last pairs'
        graph = createGraph maxIndex pairs'
    return graph

main = do
    graph <- getEdges "SCC.txt"
    --let 
        --maxIndex = fst $ V.last edges
    let 
        fts = bwdLoop graph
        leaders = fst $ execState (fwdLoop graph fts) ([], IS.empty)
    print $ length leaders

type Connections = [Int]
data Node = Node {fwd, bwd :: Connections} deriving (Show)
type Graph = V.Vector Node

type Visited = IS.IntSet
type FinishTime = Int
type FinishTimes = [FinishTime]
type Leaders = [Int]

createGraph :: Int -> [(Int, Int)] -> Graph
createGraph maxIndex pairs = 
    let
        graph  = V.replicate (maxIndex+1) (Node [] [])
        graph' = V.accum (\(Node f b) x -> Node (x:f) b) graph  pairs
    in           V.accum (\(Node f b) x -> Node f (x:b)) graph' $ map (\(a,b) -> (b,a)) pairs

bwdLoop :: Graph -> FinishTimes
bwdLoop g = fst $ execState (mapM_ go $ reverse [0 .. V.length g - 1]) ([], IS.empty) where
    go :: Int -> State (FinishTimes, Visited) ()
    go i = do
        (fTimes, vs) <- get
        let visited = IS.member i vs
        if not visited then do
            put (fTimes, IS.insert i vs)
            mapM_ go $ bwd $ g V.! i
            -- get state again after changes from mapM_
            (fTimes', vs') <- get
            put (i : fTimes', vs')
        else return ()

fwdLoop :: Graph -> FinishTimes -> State (Leaders, Visited) ()
fwdLoop _ [] = return ()
fwdLoop g (i:fts) = do
    (ls, vs) <- get
    let visited = IS.member i vs
    if not visited then do
        put (i:ls, IS.insert i vs)
        mapM_ go $ fwd $ g V.! i
    else return ()
    fwdLoop g fts

    where
        go :: Int -> State (Leaders, Visited) ()
        go i = do
            (ls, vs) <- get
            let visited = IS.member i vs
            if not visited then do
                put (ls, IS.insert i vs)
                mapM_ go $ fwd $ g V.! i
            else return ()

这篇关于大型向量优化操纵的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆