从文件中优化Haskell数据读取 [英] Optimising Haskell data reading from file
问题描述
我试图在 3.5米线上实施科萨拉朱的图表算法文件,其中每行是两个(空格分隔的)表示图边的Ints。要开始,我需要创建一个摘要数据结构,其中包含节点和其传入和传出边缘的列表。下面的代码实现了这一点,但需要一分钟,而我可以从MOOC论坛上的帖子看到使用其他语言的人正在<10s内完成。 ( getLines
取10s,而我读到的基准测试中的取值小于1)。我是Haskell的新手并且使用 foldl'
('
是使其终止的突破)实现了一种累积方法,但是这在风格上感觉相当迫切,我希望这就是运行缓慢的原因。此外,我目前正计划使用类似的模式来进行深度优先搜索,我担心它会变得太慢。
我找到了此介绍和博客,这些话题讨论了这些问题,但在专家层面上太过专业。 / p>
导入System.IO
$ p $使用地图:
导入Control.Monad
导入Data.Map.Strict作为地图
将Data.List导入为L
类型NodeName = Int
类型Edges = [NodeName]
类型Explored = Bool
data Node = Node Explored (Edges,Edges)派生(显示)
类型Graph1 = Map NodeName节点
getLines :: FilePath - > IO [[Int]]
getLines = liftM(fmap(fmap read。words)。lines)。 readFile
getLines':: FilePath - > IO [(Int,Int)]
getLines'= liftM(fmap(tuplify2。fmap read。words)。lines)。 readFile
tuplify2 :: [a] - > (a,a)
tuplify2 [x,y] =(x,y)
main = do
list< - getLinestestdata.txt - [String ]
--list< - getLinesSCC.txt - [String]
let
list'= createGraph list
返回列表'
createGraph :: [[Int]] - > Graph1
createGraph xs = L.foldl'build Map.empty xs
其中
build :: Graph1-> [Int] - > Graph1
build = \acc(x:y:_) - >
let tmpAcc = case Map.lookup x acc
Nothing - > Map.insert x(Node False([y],[]))acc
只是 - > Map.adjust(\(Node _(fwd,bck)) - >(Node False((y:fwd),bck)))x acc
(如果Map.lookup y tmpAcc为
没有 - > Map.insert y(Node False([],[x]))tmpAcc
只是 - > Map.adjust(\(Node _(fwd,bck)) - >(Node False(fwd,(x:bck))))y tmpAcc
- 使用
解决方案code> IntMap 或
HashMap
。Int
键比Map
快得多。HashMap
通常比IntMap
快,但使用更多内存并且库的内容较少。 - 不要做不必要的查找。
容器
包有大量的专用函数。使用alter
与问题中的createGraph
实现相比,查找次数可以减半。
createGraph
示例:
import Data.List(foldl')
导入限定的Data.IntMap.Strict作为IM
类型NodeName = Int
类型Edges = [NodeName]
类型Explored = Bool
data Node = Node Explored Edges导出(Eq,Show)的边缘
类型Graph1 = IM.IntMap节点
createGraph :: [(Int,Int)] - > Graph1
createGraph xs = foldl'build IM.empty xs
where
addFwd y(Just(Node _ fb))= Just(Node False(y:f)b)
addFwd y _ = Just(Node False [y] [])
addBwd x(Just(Node _fb))= Just(Node False f(x:b))
addBwd x _ = Just节点False [] [x])
build :: Graph1 - > (Int,Int) - > Graph1
build acc(x,y)= IM.alter(addBwd x)y $ IM.alter(addFwd y)x acc
使用载体:
- 考虑高效的构造函数(累加器,展开式,
generate
,iterate
,constructN
等)。这些可能会在幕后使用突变,但比实际的可变向量使用起来要方便得多。在更一般的情况下,使用盒装向量的懒惰来构建自定义向量。 - 尽可能使用非盒装向量。
- 当您完全确定界限时使用不安全的函数。
- 只有在没有纯选项时才使用可变向量。在那种情况下,更喜欢ST monad到IO。另外,请避免创建许多可变堆对象(即优先选择可变引物到不可变引用的不可变引物)。
createGraph
示例:
将限定的Data.Vector导入为V
类型NodeName = Int
类型Edges = [NodeName]
Explored = Bool
data Node = Node Explored Edges导出(Eq,Show)的边缘
类型Graph1 = V.Vector节点
createGraph :: Int - > [(Int,Int)] - > Graph1
createGraph maxIndex edges = graph''其中
graph = V.replicate maxIndex(Node False [] [])
graph'= V.accum(\(Node efb)x - >节点e(x:f)b)图形边缘
图形'= V.accum(\(Node efb)x→Node ef(x:b))graph'(map(\\ (a,b) - >(b,a))边缘)
在节点索引范围内的差距,那么明智的做法是在做任何其他事情之前连续重新标记索引。
Node
引入一个空的构造函数来表示缺少的索引。
更快的I / O:
- 使用来自
Data.Text
或Data.ByteString
的IO函数。在这两种情况下,还有用于将输入分为行或单词的有效功能。
示例:
import合格的Data.ByteString.Char8作为BS
import System.IO
getLines :: FilePath - > IO [(Int,Int)]
getLines path = do
lines < - (map BS.words。BS.lines)`fmap` BS.readFile path
let pairs =(map (可能(错误无法读取Int)fst。BS.readInt)行
return [(a,b)| [a,b] < - 对]
标杆:
总是这样做,不像我在这个答案。使用 条件
。
I am trying to implement Kosaraju's graph algorithm, on a 3.5m line file where each row is two (space separated) Ints representing a graph edge. To start I need to create a summary data structure that has the node and lists of its incoming and outgoing edges. The code below achieves that, but takes over a minute, whereas I can see from posts on the MOOC forum that people using other languages are completing in <<10s. (getLines
is taking 10s compared to under 1s in benchmarks I read about.)
I'm new to Haskell and have implemented an accumulation method using foldl'
(the '
was a breakthrough in making it terminate at all), but it feels rather imperative in style, and I'm hoping that that's the reason why it is running slow. Moreover, I'm currently planning to use a similar pattern to conduct the depth-first-search, and I fear it will all just become too slow.
I have found this presentation and blog that talk about these sort of issues but at too expert a level.
import System.IO
import Control.Monad
import Data.Map.Strict as Map
import Data.List as L
type NodeName = Int
type Edges = [NodeName]
type Explored = Bool
data Node = Node Explored (Edges, Edges) deriving (Show)
type Graph1 = Map NodeName Node
getLines :: FilePath -> IO [[Int]]
getLines = liftM (fmap (fmap read . words) . lines) . readFile
getLines' :: FilePath -> IO [(Int,Int)]
getLines' = liftM (fmap (tuplify2 . fmap read . words) . lines) . readFile
tuplify2 :: [a] -> (a,a)
tuplify2 [x,y] = (x,y)
main = do
list <- getLines "testdata.txt" -- [String]
--list <- getLines "SCC.txt" -- [String]
let
list' = createGraph list
return list'
createGraph :: [[Int]] -> Graph1
createGraph xs = L.foldl' build Map.empty xs
where
build :: Graph1-> [Int] -> Graph1
build = \acc (x:y:_) ->
let tmpAcc = case Map.lookup x acc of
Nothing -> Map.insert x (Node False ([y],[])) acc
Just a -> Map.adjust (\(Node _ (fwd, bck)) -> (Node False ((y:fwd), bck))) x acc
in case Map.lookup y tmpAcc of
Nothing -> Map.insert y (Node False ([],[x])) tmpAcc
Just a -> Map.adjust (\(Node _ (fwd, bck)) -> (Node False (fwd, (x:bck)))) y tmpAcc
Using maps:
- Use
IntMap
orHashMap
when possible. Both are significantly faster forInt
keys thanMap
.HashMap
is usually faster thanIntMap
but uses more RAM and has a less rich library. - Don't do unnecessary lookups. The
containers
package has a large number of specialized functions. Withalter
the number of lookups can be halved compared to thecreateGraph
implementation in the question.
Example for createGraph
:
import Data.List (foldl')
import qualified Data.IntMap.Strict as IM
type NodeName = Int
type Edges = [NodeName]
type Explored = Bool
data Node = Node Explored Edges Edges deriving (Eq, Show)
type Graph1 = IM.IntMap Node
createGraph :: [(Int, Int)] -> Graph1
createGraph xs = foldl' build IM.empty xs
where
addFwd y (Just (Node _ f b)) = Just (Node False (y:f) b)
addFwd y _ = Just (Node False [y] [])
addBwd x (Just (Node _ f b)) = Just (Node False f (x:b))
addBwd x _ = Just (Node False [] [x])
build :: Graph1 -> (Int, Int) -> Graph1
build acc (x, y) = IM.alter (addBwd x) y $ IM.alter (addFwd y) x acc
Using vectors:
- Consider the efficient construction functions (the accumulators, unfolds,
generate
,iterate
,constructN
, etc.). These may use mutation behind the scenes but are considerably more convenient to use than actual mutable vectors. - In the more general case, use the laziness of boxed vectors to enable self-reference when constructing a vector.
- Use unboxed vectors when possible.
- Use unsafe functions when you're absolutely sure about the bounds.
- Only use mutable vectors when there aren't pure alternatives. In that case, prefer the ST monad to IO. Also, avoid creating many mutable heap objects (i. e. prefer mutable vectors to immutable vectors of mutable references).
Example for createGraph
:
import qualified Data.Vector as V
type NodeName = Int
type Edges = [NodeName]
type Explored = Bool
data Node = Node Explored Edges Edges deriving (Eq, Show)
type Graph1 = V.Vector Node
createGraph :: Int -> [(Int, Int)] -> Graph1
createGraph maxIndex edges = graph'' where
graph = V.replicate maxIndex (Node False [] [])
graph' = V.accum (\(Node e f b) x -> Node e (x:f) b) graph edges
graph'' = V.accum (\(Node e f b) x -> Node e f (x:b)) graph' (map (\(a, b) -> (b, a)) edges)
Note that if there are gaps in the range of the node indices, then it'd be wise to either
- Contiguously relabel the indices before doing anything else.
- Introduce an empty constructor to
Node
to signify a missing index.
Faster I/O:
- Use the IO functions from
Data.Text
orData.ByteString
. In both cases there are also efficient functions for breaking input into lines or words.
Example:
import qualified Data.ByteString.Char8 as BS
import System.IO
getLines :: FilePath -> IO [(Int, Int)]
getLines path = do
lines <- (map BS.words . BS.lines) `fmap` BS.readFile path
let pairs = (map . map) (maybe (error "can't read Int") fst . BS.readInt) lines
return [(a, b) | [a, b] <- pairs]
Benchmarking:
Always do it, unlike me in this answer. Use criterion
.
这篇关于从文件中优化Haskell数据读取的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!