从文件中优化Haskell数据读取 [英] Optimising Haskell data reading from file

查看：73 发布时间：2018/6/4 16:04:55 haskell

本文介绍了从文件中优化Haskell数据读取的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我试图在 3.5米线上实施科萨拉朱的图表算法文件，其中每行是两个（空格分隔的）表示图边的Ints。要开始，我需要创建一个摘要数据结构，其中包含节点和其传入和传出边缘的列表。下面的代码实现了这一点，但需要一分钟，而我可以从MOOC论坛上的帖子看到使用其他语言的人正在<10s内完成。（ getLines 取10s，而我读到的基准测试中的取值小于1）。我是Haskell的新手并且使用 foldl'（'是使其终止的突破）实现了一种累积方法，但是这在风格上感觉相当迫切，我希望这就是运行缓慢的原因。此外，我目前正计划使用类似的模式来进行深度优先搜索，我担心它会变得太慢。

我找到了此介绍和博客，这些话题讨论了这些问题，但在专家层面上太过专业。 / p>

导入System.IO 导入Control.Monad 导入Data.Map.Strict作为地图将Data.List导入为L 类型NodeName = Int 类型Edges = [NodeName] 类型Explored = Bool data Node = Node Explored （Edges，Edges）派生（显示）类型Graph1 = Map NodeName节点 getLines :: FilePath - > IO [[Int]] getLines = liftM（fmap（fmap read。words）。lines）。 readFile getLines':: FilePath - > IO [（Int，Int）] getLines'= liftM（fmap（tuplify2。fmap read。words）。lines）。 readFile tuplify2 :: [a] - > （a，a） tuplify2 [x，y] =（x，y） main = do list< - getLinestestdata.txt - [String ] --list< - getLinesSCC.txt - [String] let list'= createGraph list 返回列表' createGraph :: [[Int]] - > Graph1 createGraph xs = L.foldl'build Map.empty xs 其中 build :: Graph1-> [Int] - > Graph1 build = \acc（x：y：_） - > let tmpAcc = case Map.lookup x acc Nothing - > Map.insert x（Node False（[y]，[]））acc 只是 - > Map.adjust（\（Node _（fwd，bck）） - >（Node False（（y：fwd），bck）））x acc （如果Map.lookup y tmpAcc为没有 - > Map.insert y（Node False（[]，[x]））tmpAcc 只是 - > Map.adjust（\（Node _（fwd，bck）） - >（Node False（fwd，（x：bck））））y tmpAcc

使用

解决方案
code> IntMap 或 HashMap 。 Int 键比 Map 快得多。 HashMap 通常比 IntMap 快，但使用更多内存并且库的内容较少。

不要做不必要的查找。 容器包有大量的专用函数。使用 alter 与问题中的 createGraph 实现相比，查找次数可以减半。

createGraph 示例：
import Data.List（foldl'）导入限定的Data.IntMap.Strict作为IM 类型NodeName = Int 类型Edges = [NodeName] 类型Explored = Bool data Node = Node Explored Edges导出（Eq，Show）的边缘类型Graph1 = IM.IntMap节点 createGraph :: [（Int，Int）] - > Graph1 createGraph xs = foldl'build IM.empty xs where addFwd y（Just（Node _ fb））= Just（Node False（y：f）b） addFwd y _ = Just（Node False [y] []） addBwd x（Just（Node _fb））= Just（Node False f（x：b）） addBwd x _ = Just节点False [] [x]） build :: Graph1 - > （Int，Int） - > Graph1 build acc（x，y）= IM.alter（addBwd x）y $ IM.alter（addFwd y）x acc

使用载体：

考虑高效的构造函数（累加器，展开式， generate ， iterate ， constructN 等）。这些可能会在幕后使用突变，但比实际的可变向量使用起来要方便得多。在更一般的情况下，使用盒装向量的懒惰来构建自定义向量。

尽可能使用非盒装向量。

当您完全确定界限时使用不安全的函数。

只有在没有纯选项时才使用可变向量。在那种情况下，更喜欢ST monad到IO。另外，请避免创建许多可变堆对象（即优先选择可变引物到不可变引用的不可变引物）。

createGraph 示例：
将限定的Data.Vector导入为V 类型NodeName = Int 类型Edges = [NodeName] Explored = Bool data Node = Node Explored Edges导出（Eq，Show）的边缘类型Graph1 = V.Vector节点 createGraph :: Int - > [（Int，Int）] - > Graph1 createGraph maxIndex edges = graph''其中 graph = V.replicate maxIndex（Node False [] []） graph'= V.accum（\（Node efb）x - >节点e（x：f）b）图形边缘图形'= V.accum（\（Node efb）x→Node ef（x：b））graph'（map（\\ （a，b） - >（b，a））边缘）
在节点索引范围内的差距，那么明智的做法是在做任何其他事情之前连续重新标记索引。

为 Node 引入一个空的构造函数来表示缺少的索引。

更快的I / O：

使用来自 Data.Text 或 Data.ByteString 的IO函数。在这两种情况下，还有用于将输入分为行或单词的有效功能。

示例：

import合格的Data.ByteString.Char8作为BS import System.IO getLines :: FilePath - > IO [（Int，Int）] getLines path = do lines < - （map BS.words。BS.lines）`fmap` BS.readFile path let pairs =（map （可能（错误无法读取Int）fst。BS.readInt）行 return [（a，b）| [a，b] < - 对]

标杆：

总是这样做，不像我在这个答案。使用 条件 。
I am trying to implement Kosaraju's graph algorithm, on a 3.5m line file where each row is two (space separated) Ints representing a graph edge. To start I need to create a summary data structure that has the node and lists of its incoming and outgoing edges. The code below achieves that, but takes over a minute, whereas I can see from posts on the MOOC forum that people using other languages are completing in <<10s. (getLines is taking 10s compared to under 1s in benchmarks I read about.)

I'm new to Haskell and have implemented an accumulation method using foldl' (the ' was a breakthrough in making it terminate at all), but it feels rather imperative in style, and I'm hoping that that's the reason why it is running slow. Moreover, I'm currently planning to use a similar pattern to conduct the depth-first-search, and I fear it will all just become too slow.

I have found this presentation and blog that talk about these sort of issues but at too expert a level.
import System.IO import Control.Monad import Data.Map.Strict as Map import Data.List as L type NodeName = Int type Edges = [NodeName] type Explored = Bool data Node = Node Explored (Edges, Edges) deriving (Show) type Graph1 = Map NodeName Node getLines :: FilePath -> IO [[Int]] getLines = liftM (fmap (fmap read . words) . lines) . readFile getLines' :: FilePath -> IO [(Int,Int)] getLines' = liftM (fmap (tuplify2 . fmap read . words) . lines) . readFile tuplify2 :: [a] -> (a,a) tuplify2 [x,y] = (x,y) main = do list <- getLines "testdata.txt" -- [String] --list <- getLines "SCC.txt" -- [String] let list' = createGraph list return list' createGraph :: [[Int]] -> Graph1 createGraph xs = L.foldl' build Map.empty xs where build :: Graph1-> [Int] -> Graph1 build = \acc (x:y:_) -> let tmpAcc = case Map.lookup x acc of Nothing -> Map.insert x (Node False ([y],[])) acc Just a -> Map.adjust (\(Node _ (fwd, bck)) -> (Node False ((y:fwd), bck))) x acc in case Map.lookup y tmpAcc of Nothing -> Map.insert y (Node False ([],[x])) tmpAcc Just a -> Map.adjust (\(Node _ (fwd, bck)) -> (Node False (fwd, (x:bck)))) y tmpAcc

解决方案
Using maps:

Use IntMap or HashMap when possible. Both are significantly faster for Int keys than Map. HashMap is usually faster than IntMap but uses more RAM and has a less rich library.

Don't do unnecessary lookups. The containers package has a large number of specialized functions. With alter the number of lookups can be halved compared to the createGraph implementation in the question.

Example for createGraph:
import Data.List (foldl') import qualified Data.IntMap.Strict as IM type NodeName = Int type Edges = [NodeName] type Explored = Bool data Node = Node Explored Edges Edges deriving (Eq, Show) type Graph1 = IM.IntMap Node createGraph :: [(Int, Int)] -> Graph1 createGraph xs = foldl' build IM.empty xs where addFwd y (Just (Node _ f b)) = Just (Node False (y:f) b) addFwd y _ = Just (Node False [y] []) addBwd x (Just (Node _ f b)) = Just (Node False f (x:b)) addBwd x _ = Just (Node False [] [x]) build :: Graph1 -> (Int, Int) -> Graph1 build acc (x, y) = IM.alter (addBwd x) y $ IM.alter (addFwd y) x acc

Using vectors:

Consider the efficient construction functions (the accumulators, unfolds, generate, iterate, constructN, etc.). These may use mutation behind the scenes but are considerably more convenient to use than actual mutable vectors.

In the more general case, use the laziness of boxed vectors to enable self-reference when constructing a vector.

Use unboxed vectors when possible.

Use unsafe functions when you're absolutely sure about the bounds.

Only use mutable vectors when there aren't pure alternatives. In that case, prefer the ST monad to IO. Also, avoid creating many mutable heap objects (i. e. prefer mutable vectors to immutable vectors of mutable references).

Example for createGraph:
import qualified Data.Vector as V type NodeName = Int type Edges = [NodeName] type Explored = Bool data Node = Node Explored Edges Edges deriving (Eq, Show) type Graph1 = V.Vector Node createGraph :: Int -> [(Int, Int)] -> Graph1 createGraph maxIndex edges = graph'' where graph = V.replicate maxIndex (Node False [] []) graph' = V.accum (\(Node e f b) x -> Node e (x:f) b) graph edges graph'' = V.accum (\(Node e f b) x -> Node e f (x:b)) graph' (map (\(a, b) -> (b, a)) edges)
Note that if there are gaps in the range of the node indices, then it'd be wise to either

Contiguously relabel the indices before doing anything else.

Introduce an empty constructor to Node to signify a missing index.

Faster I/O:

Use the IO functions from Data.Text or Data.ByteString. In both cases there are also efficient functions for breaking input into lines or words.

Example:
import qualified Data.ByteString.Char8 as BS import System.IO getLines :: FilePath -> IO [(Int, Int)] getLines path = do lines <- (map BS.words . BS.lines) `fmap` BS.readFile path let pairs = (map . map) (maybe (error "can't read Int") fst . BS.readInt) lines return [(a, b) | [a, b] <- pairs]

Benchmarking:

Always do it, unlike me in this answer. Use criterion.

这篇关于从文件中优化Haskell数据读取的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

从文件中优化Haskell数据读取 [英] Optimising Haskell data reading from file

问题描述

使用载体：

更快的I / O：

标杆：

Using maps:

Using vectors:

Faster I/O:

Benchmarking:

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

从文件中优化Haskell数据读取 [英] Optimising Haskell data reading from file

问题描述

使用载体：

更快的I / O：

标杆：

Using maps:

Using vectors:

Faster I/O:

Benchmarking:

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭