Haskell:扫描列表并为每个元素应用不同的函数 [英] Haskell: Scan Through a List and Apply A Different Function for Each Element

查看:65
本文介绍了Haskell:扫描列表并为每个元素应用不同的函数的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我需要扫描文档并为文件中的每个字符串累积不同函数的输出。在文件的任何给定行上运行的函数取决于该行中的内容。

我可以非常低效地完成这项工作,为每个我想收集的列表完整地传递文件。示例伪代码:

  at :: B.ByteString  - >也许原子
在行
|处行== ATOM记录=做东西返回只是原子
|否则= Nothing

ot :: B.ByteString - >也许Sheet

| line == SHEET record =做其他的东西来返回Just Sheet
|否则= Nothing

然后,我会将这些函数映射到文件中的整行列表以获得原子和表格的完整列表:

  mapper :: [B.ByteString]  - > IO()
mapper lines = do
let atoms = mapMaybe at lines
let sheets = mapMaybe to lines
- 用我的原子和图表做什么

然而,这是低效率的,因为我正在为我想创建的每个列表映射整个字符串列表。相反,我只想映射一次线串列表,在我移动它时识别每一行,然后应用适当的函数并将这些值存储在不同的列表中。



我的C心态想要这样做(伪代码):

  mapper':: [B.ByteString]  - > IO()
mapper'lines = do
let atoms = []
let sheets = []
for line in line:
| line == ATOM record =(atoms = atoms ++ at line)
| line == SHEET record =(sheets = sheets ++ ot line)
- 现在'atoms'是所有ATOM记录的完整列表
- 'sheets'是所有ATOM记录的完整列表SHEET记录

Haskell做这件事的方法是什么?我根本无法得到我的功能编程思想来提出解决方案。

解决方案

首先,我认为其他人提供的答案至少可以在95%的时间内工作。通过使用适当的数据类型(或在某些情况下的元组)来编写手头问题总是一个好习惯。但是,有时候你实际上并不知道你在列表中寻找什么,在这些情况下,试图列举所有可能性是困难的/耗时的/容易出错的。或者,您正在编写多种相同类型的变体(手动将多个折叠内联到一个),并且您想要捕获抽象。



幸运的是,是一些可以提供帮助的技术。

框架解决方案



(有点自我宣传)首先,各种iteratee / enumerator软件包通常提供处理这类问题的函数。

我最熟悉迭代,它可以让您执行以下操作:

  import Data.Iteratee as I 
import Data.Iteratee.Char
import Data.Maybe

- 首先,你需要一些方法来处理Atoms / Sheets / etc。你得到的是
- 如果你想把它们作为列表返回,你可以使用内置的
- stream2list函数

- 接下来,创建流变形金刚
- 在:: B.ByteString - >也许Atom
- 创建一个从ByteString行到原子
的流转换器atIter :: Enumeratee [B.ByteString] [Atom] ma
atIter = I.mapChunks(catMaybes。map at)

otIter :: Enumeratee [B.ByteString] [Sheet] ma
otIter = I.mapChunks(catMaybes。map ot)

- 最后,将多个处理器成一个
- 如果你有多个处理器,你可以使用zip3,zip4等
procFile :: Iteratee [B.ByteString] m([Atom],[Sheet])
procFile = I.zip(atIter = $ stream2list)(otIter = $ stream2list)

- 在一些数据上运行
runner :: FilePath - > IO([Atom],[Sheet])
runner filename = do
resultIter< - enumFile defaultBufSize filename $ = enumLinesBS $ procFile
run resultIter

这给您带来的一个好处是额外的可组合性。你可以随意创建变形金刚,并将它们与拉链结合起来。如果你愿意的话,你甚至可以并行地运行消费者(尽管只有当你在 IO monad中工作时,可能不值得,除非消费者做了很多工作)改变为:

  import Data.Iteratee.Parallel 

parProcFile = I.zip( parI $ atIter = $ stream2list)(parI $ otIter = $ stream2list)

这样做的结果isn与单个for循环相同 - 这仍然会执行多次遍历的数据。但是,遍历模式已经改变。这将一次加载一定数量的数据( defaultBufSize 字节)并多次遍历该块,根据需要存储部分结果。块完全耗尽后,下一块被加载,旧块可以被垃圾收集。



希望这可以证明不同点:

  Data.List.zip:
x1 x2 x3 .. x_n
x1 x2 x3 .. x_n

Data.Iteratee.zip:
x1 x2 x3 x4 x_n-1 x_n
x1 x2 x3 x4 x_n-1 x_n

如果您做的工作足够平行,那么这根本就不是问题。由于内存局部性,性能要比整个输入的多次遍历好得多,因为 Data.List.zip 会产生。



美丽的解决方案



如果单遍解决方案确实最有意义,那么您可能会对Max Rabkin的 followup work this )。基本的想法是,您可以创建数据结构来表示折叠和拉链,并且将这些组合起来可以创建仅需要一次遍历的新的组合折叠/拉链功能。对于一个Haskell初学者来说,这可能有点高级,但是因为你正在考虑这个问题,所以你会觉得它很有趣或者有用。 Max的帖子可能是最好的出发点。


I need to scan through a document and accumulate the output of different functions for each string in the file. The function run on any given line of the file depends on what is in that line.

I could do this very inefficiently by making a complete pass through the file for every list I wanted to collect. Example pseudo-code:

at :: B.ByteString -> Maybe Atom
at line
    | line == ATOM record = do stuff to return Just Atom
    | otherwise = Nothing

ot :: B.ByteString -> Maybe Sheet
ot line
    | line == SHEET record = do other stuff to return Just Sheet
    | otherwise = Nothing

Then, I would map each of these functions over the entire list of lines in the file to get a complete list of Atoms and Sheets:

mapper :: [B.ByteString] -> IO ()
mapper lines = do
    let atoms = mapMaybe at lines
    let sheets = mapMaybe to lines
    -- Do stuff with my atoms and sheets

However, this is inefficient because I am maping through the entire list of strings for every list I am trying to create. Instead, I want to map through the list of line strings only once, identify each line as I am moving through it, and then apply the appropriate function and store these values in different lists.

My C mentality wants to do this (pseudo code):

mapper' :: [B.ByteString] -> IO ()
mapper' lines = do
    let atoms = []
    let sheets = []
    for line in lines:
        | line == ATOM record = (atoms = atoms ++ at line)
        | line == SHEET record = (sheets = sheets ++ ot line)
    -- Now 'atoms' is a complete list of all the ATOM records
    --  and 'sheets' is a complete list of all the SHEET records

What is the Haskell way of doing this? I simply can't get my functional-programming mindset to come up with a solution.

解决方案

First of all, I think that the answers others have supplied will work at least 95% of the time. It's always good practice to code for the problem at hand by using appropriate data types (or tuples in some cases). However, sometimes you really don't know in advance what you're looking for in the list, and in these cases trying to enumerate all possibilities is difficult/time-consuming/error-prone. Or, you're writing multiple variants of the same sort of thing (manually inlining multiple folds into one) and you'd like to capture the abstraction.

Fortunately, there are a few techniques that can help.

The framework solution

(somewhat self-evangelizing)

First, the various "iteratee/enumerator" packages often provide functions to deal with this sort of problem. I'm most familiar with iteratee, which would let you do the following:

import Data.Iteratee as I
import Data.Iteratee.Char
import Data.Maybe

-- first, you'll need some way to process the Atoms/Sheets/etc. you're getting
-- if you want to just return them as a list, you can use the built-in
-- stream2list function

-- next, create stream transformers
-- given at :: B.ByteString -> Maybe Atom
-- create a stream transformer from ByteString lines to Atoms
atIter :: Enumeratee [B.ByteString] [Atom] m a
atIter = I.mapChunks (catMaybes . map at)

otIter :: Enumeratee [B.ByteString] [Sheet] m a
otIter = I.mapChunks (catMaybes . map ot)

-- finally, combine multiple processors into one
-- if you have more than one processor, you can use zip3, zip4, etc.
procFile :: Iteratee [B.ByteString] m ([Atom],[Sheet])
procFile = I.zip (atIter =$ stream2list) (otIter =$ stream2list)

-- and run it on some data
runner :: FilePath -> IO ([Atom],[Sheet])
runner filename = do
  resultIter <- enumFile defaultBufSize filename $= enumLinesBS $ procFile
  run resultIter

One benefit this gives you is extra composability. You can create transformers as you like, and just combine them with zip. You can even run the consumers in parallel if you like (although only if you're working in the IO monad, and probably not worth it unless the consumers do a lot of work) by changing to this:

import Data.Iteratee.Parallel

parProcFile = I.zip (parI $ atIter =$ stream2list) (parI $ otIter =$ stream2list)

The result of doing so isn't the same as a single for-loop - this will still perform multiple traversals of the data. However, the traversal pattern has changed. This will load a certain amount of data at once (defaultBufSize bytes) and traverse that chunk multiple times, storing partial results as necessary. After a chunk has been entirely consumed, the next chunk is loaded and the old one can be garbage collected.

Hopefully this will demonstrate the difference:

Data.List.zip:
  x1 x2 x3 .. x_n
                   x1 x2 x3 .. x_n

Data.Iteratee.zip:
  x1 x2      x3 x4      x_n-1 x_n
       x1 x2      x3 x4           x_n-1 x_n

If you're doing enough work that parallelism makes sense this isn't a problem at all. Due to memory locality, the performance is much better than multiple traversals over the entire input as Data.List.zip would make.

The beautiful solution

If a single-traversal solution really does make the most sense, you might be interested in Max Rabkin's Beautiful Folding post, and Conal Elliott's followup work (this too). The essential idea is that you can create data structures to represent folds and zips, and combining these lets you create a new, combined fold/zip function that only needs one traversal. It's maybe a little advanced for a Haskell beginner, but since you're thinking about the problem you may find it interesting or useful. Max's post is probably the best starting point.

这篇关于Haskell:扫描列表并为每个元素应用不同的函数的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆