在Haskell中实现分裂的一行代码 [英] One-line implementation of split in Haskell

查看:162
本文介绍了在Haskell中实现分裂的一行代码的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想要的是下面这个(我认为它应该包含在前奏中,因为它在文本处理中非常有用):

  split :: Eq a => [a]  - > [a]  - > [b] 
$ / code>

例如:

  split;; hello ;; world= [hello,world] 

split 来自 Data.List.Utils 不在基础中。我觉得应该通过编写一些基本函数来实现一个简短而甜蜜的实现,但我无法弄清楚。我错过了什么?

解决方案

可以这么说,最好的方法是检查短而甜的 splitOn (或 split ,就像你和 MissingH 称呼它 - 在这里我将坚持使用< a href =https://hackage.haskell.org/package/split-0.2.3.3/docs/Data-List-Split.html#v:splitOn =nofollow noreferrer> split 附加 包)正在尝试写它[注1]。
$ b

(顺便说一句,我将使用 recursion-schemes 这个答案中的函数和概念,就像我发现系统化的东西有点帮助我思考这种问题如果有什么不清楚的话,请告诉我。)



splitOn 的类型是[注2]:

  splitOn :: Eq a => [a]  - > [a]  - > [a]] 

一种编写递归函数的方法,可以从另一个数据结构 splitOn 确实是这样做的,首先询问是否通过在自下而上自上而下 (对于列表,分别为从右到左和从左到右)。自下而上的散步更自然地表现为某种折叠:

  foldr @ [] ::(a  - > b→b)→> b  - > [a]  - > b 
cata @ [_] ::(ListF a b - > b) - > [a] - > b

cata recurstion-schemes 表达了一个vanilla fold。 ListF ab - > b 函数,称为代数该术语指定了每个折叠步骤中会发生什么 data ListF ab = Nil | Cons ab ,所以在列表的情况下,代数等于两个第一个参数 foldr 合并成一个 - 二元函数对应于 Cons 大小写,并且折叠的种子为

另一方面,自上而下的散步有助于展开:

  unfoldr ::(b  - > Maybe(a,b)) - > b  - > [a]  - 在Data.List中找到
ana @ [_] ::(b - > ListF a b) - > b - > [a]

ana , anamorphism,是在 recursion-schemes 中展开的香草。 b - > ListF ab 函数是一个 coalgebra 。它指定了每个展开步骤会发生什么,对于一个列表,可能性是发出一个列表元素和一个更新的种子,或者生成一个空列表并终止展开。)



splitOn 是自下而上还是自上而下?为了实现它,我们需要在列表中的任何位置处向前看,以检查当前列表段是否以分隔符开始。这就是说,达成自顶向下的解决方案是合理的,即展开/变形。



使用写入 splitOn code>展开展示了另一个需要考虑的事情:您希望每个人展开步骤以生成完整列表块。如果不这样做,充其量会导致您不必要地将原始列表两次走过[注3];最糟糕的是,长列表块上的灾难性内存使用和堆栈溢出正在等待[注4]。一种方法是通过 breakOn 函数来实现,就像 Data.List.Extra ...

  breakOn :: Eq a => [a]  - > [a]  - > ([a],[a])

...就像在Prelude中打破,不同之处在于,不是将谓词应用于每个元素,而是检查剩余列表段是否具有第一个参数作为前缀[注释5]。



使用 breakOn ,我们可以编写一个恰当的 splitOn 实现 - 一个,以优化方式进行编译,匹配开始时提到的库的性能:

  splitOnAtomic :: Eq a => [a]  - > [a]  - > [[a]] 
splitOnAtomic delim
| null delim =错误splitOnAtomic:空分隔符
|否则= apo coalgSplit
其中
delimLen =长度限定
coalgSplit = \ case
[] - >缺点[](左[])
li - >
使用(ch,xs)= breakOn(delim`isPrefixOf`)li
在Cons ch(Right(drop delimLen xs))

apo ,简称为apomorphism,是一种可以短路的展开式。展开步骤,而不是通常更新的种子 - 通过右标记 - 最终结果 - 由左标记这里需要短路,因为在空列表的情况下,我们不希望通过返回 Nil 产生一个空列表,这会错误地导致 splitOn delim [] = [] - 也不会求助于 Cons [] [] - 这会产生无限的 [] 。这个技巧直接对应于添加到 splitOn _ [] = [[]] href =http://hackage.haskell.org/package/extra-1.6.4/docs/src/Data-List-Extra.html#splitOn =nofollow noreferrer> Data。 List.Extra 执行。)



经过一些小小的弯路,我们现在可以解决您的实际问题。 splitOn 在很短的时间内编写起来很棘手,因为首先,它使用的递归模式并不是微不足道的;其次,良好的实施需要一些不方便打高尔夫球的细节;第三,看起来最好的实现依赖于 breakOn ,它不在 base 中。






注意:

[注1]:在这个答案中运行片段所需的杂注:

  { - #LANGUAGE LambdaCase# - } 
{ - #语言DeriveFunctor# - }
{ - #语言DerivedFoldable# - }
{ - #LANGUAGE TemplateHaskell # - }

导入Data.Functor.Foldable
导入Data.Functor.Foldable.TH
导入Data.List
导入Data.Maybe

[note 2]:另一种类型可能是 Eq a = > NonEmpty a - > [a] - > NonEmpty [a] ,如果你想把精度放在一切之上。在这里我不会打扰,以避免不必要的干扰。



[注3]:在这个相当整洁的实现中,展开 - 第一个( ana coalgMark )用 Nothing替换分隔符,这样第二个(<$

  splitOnMark :: Eq a(c $ c> apo coalgSplit )可以以直接的方式拆分:

=> [a] - > [a] - > [[a]]
splitOnMark分界
| null delim =错误splitOnMark:空分隔符
|否则= apo coalgSplit。 ana coalgMark
where
coalgMark = \ case
[] - >无
li @(x:xs) - >案例strip
只是ys - >缺点无ys
无 - >缺点(只要x)xs
coalgSplit = \ case
[] - >缺点[](左[])
mxs - >
让(mch,mys)= break在cons中没有mxs
(catMaybes mch)(Right(drop 1 mys))


$ b $ <什么 apo 是什么 Left



这个实现具有相当可接受的性能,不过,通过优化,它比答案的主体中的一个慢一个(适度)常数因子慢。虽然... ...高尔夫球可能会更容易一些,但是...



[注4]:在这个单一展开实施中,使用一个代数递归地将每个块构建为(差异)列表:

  splitOnNaive :: Eq a => [a]  - > [a]  - > [[a]] 
splitOnNaive分隔符
| null delim =错误splitOn:空分隔符
|否则= apo coalgSplit。 (,)id
其中
coalgSplit = \ case
(ch,[]) - > Cons(ch [])(Left [])
(ch,li @(x:xs)) - >案例strip
只是ys - > Cons(ch [])(Right(id,ys))
Nothing - > coalg(ch。(x :),xs)

必须决定每个元素是否增长当前大块或开始一个新块本身是有问题的,因为它打破了懒惰。

[注5]: 这是数据的方法。 List.Extra 实现 breakOn 。如果我们想用recursion-schemes 展开实现,一个好的策略是定义一个数据结构,它准确地编码我们正在构建的内容:

 数据BrokenList a = Broken [a] | Unbroken a(BrokenList a)
derived(Eq,Show,Functor,Foldable,Traversable)
makeBaseFunctor''BrokenList



BrokenList 就像一个列表,只是空列表被替换为(非递归) Broken 构造函数,它标记断点并保存列表的其余部分。一旦展开生成, BrokenList 可以轻松地折叠成一对列表: Unbroken 值中的元素被列入一个列表,并且 Broken 中的列表成为另一个:

  breakOn ::([a]  - > Bool) - > [a]  - > ([a],[a])
breakOn p = hylo algPair coalgBreak
where
coalgBreak = \ case
[] - > BrokenF []
li @(x:xs)
| p li - > BrokenF li
|否则 - > UnbrokenF x xs
algPair = \ case
UnbrokenF x〜(xs,ys) - > (x:xs,ys)
BrokenF ys - > ([],ys)

hylo ,简写为 ana ,后面跟着 cata ,即展开后跟一个折叠。<$在 recursion-schemes 中实现的c $ c> hylo 利用了这样一个事实,即中间数据结构由展开并由折叠立即消耗而创建, )



值得一提的是, algPair 对于保持懒惰至关重要。与上面链接的 Data.List.Extra 实现通过从 Control使用 first 实现。箭头,这也与懒惰地给它的一对匹配。


What I want is the following one (which I think should be included in prelude since it is very useful in text processing):

split :: Eq a => [a] -> [a] -> [[a]]

e.g:

split ";;" "hello;;world" = ["hello", "world"]

split from Data.List.Utils isn't in base. I feel there should be a short-and-sweet implementation by composing a few base functions, but I can't figure it out. Am I missing something?

解决方案

Arguably, the best way to check how feasible a short-and-sweet splitOn (or split, as you and MissingH call it -- here I will stick to the name used by the split and extra packages) is trying to write it [note 1].

(By the way, I will use recursion-schemes functions and concepts in this answer, as I find systematising things a bit helps me think about this kind of problem. Do let me know if anything is unclear.)

The type of splitOn is [note 2]:

splitOn :: Eq a => [a] -> [a] -> [[a]]

One way to write a recursive function that builds one data structure from another, like splitOn does, begins by asking whether to do it by walking the original structure in a bottom-up or a top-down way (for lists, that amounts to right-to-left and left-to-right respectively). A bottom-up walk is more naturally expressed as some kind of fold:

foldr @[] :: (a -> b -> b) -> b -> [a] -> b
cata @[_] :: (ListF a b -> b) -> [a] -> b

(cata, short for catamorphism, is how recurstion-schemes expresses a vanilla fold. The ListF a b -> b function, called an algebra in the jargon, specifies what happens in each fold step. data ListF a b = Nil | Cons a b, and so, in the case of lists, the algebra amounts to the two first arguments of foldr rolled into one -- the binary function corresponds to the Cons case, and the seed of the fold, to the Nil one.)

A top-down walk, on the other hand, lends itself to an unfold:

unfoldr :: (b -> Maybe (a, b)) -> b -> [a] -- found in Data.List
ana @[_] :: (b -> ListF a b) -> b -> [a]

(ana, short for anamorphism, is the vanilla unfold in recursion-schemes. The b -> ListF a b function is a coalgebra; it specifies what happens in each unfold step. For a list, the possibilities are either emitting a list element and an updated seed or generating an empty list and terminating the unfold.)

Should splitOn be bottom-up or top-down? To implement it, we need to, at any given position in the list, look ahead in order to check whether the current list segment starts with the delimiter. That being so, it makes sense to reach for a top-down solution i.e. an unfold/anamorphism.

Playing with ways to write splitOn as an unfold shows another thing to consider: you want each individual unfold step to generate a fully-formed list chunk. Not doing so will, at best, lead you to unnecessarily walk the original list twice [note 3]; at worst, catastrophic memory usage and stack overflows on long list chunks await [note 4]. One way to achieve that is through a breakOn function, like the one in Data.List.Extra...

breakOn :: Eq a => [a] -> [a] -> ([a], [a]) 

... which is like break from the Prelude, except that, instead of applying a predicate to each element, it checks whether the remaining list segment has the first argument as a prefix [note 5].

With breakOn at hand, we can write a proper splitOn implementation -- one that, compiled with optimisations, matches in performance the library ones mentioned at the beginning:

splitOnAtomic :: Eq a => [a] -> [a] -> [[a]]
splitOnAtomic delim
    | null delim = error "splitOnAtomic: empty delimiter"
    | otherwise = apo coalgSplit
    where
    delimLen = length delim
    coalgSplit = \case
        [] -> Cons [] (Left [])
        li ->
            let (ch, xs) = breakOn (delim `isPrefixOf`) li
            in Cons ch (Right (drop delimLen xs))

(apo, short for apomorphism, is an unfold that can be short-circuited. That is done by emitting from an unfold step, rather than the usual updated seed -- signaled by Right -- a final result -- signaled by Left. Short-circuiting is needed here because, in the empty list case, we want neither to produce an empty list by returning Nil -- which would wrongly result in splitOn delim [] = [] -- nor to resort to Cons [] [] -- which would generate an infinite tail of []. This trick corresponds directly to the additional splitOn _ [] = [[]] case added to the Data.List.Extra implementation.)

After a few slight detours, we can now address your actual question. splitOn is tricky to write in a short way because, firstly, the recursion pattern it uses isn't entirely trivial; secondly, a good implementation requires a few details that are inconvenient for golfing; and thirdly, what appears to be the best implementation relies crucially on breakOn, which is not in base.


Notes:

[note 1]: Here are the imports and pragmas needed to run the snippets in this answer:

{-# LANGUAGE LambdaCase #-}
{-# LANGUAGE DeriveFunctor #-}
{-# LANGUAGE DeriveFoldable #-}
{-# LANGUAGE DeriveTraversable #-}
{-# LANGUAGE TypeFamilies #-}
{-# LANGUAGE TemplateHaskell #-}

import Data.Functor.Foldable
import Data.Functor.Foldable.TH
import Data.List
import Data.Maybe

[note 2]: An alternative type might be Eq a => NonEmpty a -> [a] -> NonEmpty [a], if one wants to put precision above all else. I won't bother with that here to avoid unnecessary distractions.

[note 3]: As in this rather neat implementation, which uses two unfolds -- the first one (ana coalgMark) replaces the delimiters with Nothing, so that the second one (apo coalgSplit) can split in a straightforward way:

splitOnMark :: Eq a => [a] -> [a] -> [[a]]
splitOnMark delim
    | null delim = error "splitOnMark: empty delimiter"
    | otherwise = apo coalgSplit . ana coalgMark
    where
    coalgMark = \case
        [] -> Nil
        li@(x:xs) -> case stripPrefix delim li of
            Just ys -> Cons Nothing ys
            Nothing -> Cons (Just x) xs
    coalgSplit = \case
        [] -> Cons [] (Left [])
        mxs ->
            let (mch, mys) = break isNothing mxs
            in Cons (catMaybes mch) (Right (drop 1 mys))

(What apo is and what Left and Right are doing here will be covered a little further in the main body of the answer.)

This implementation has fairly acceptable performance, though with optimisations it is a slower than the one in the main body of the answer by a (modest) constant factor. It might be a little easier to golf this one, though...

[note 4]: As in this single unfold implementation, which uses a coalgebra that calls itself recursively to build each chunk as a (difference) list:

splitOnNaive :: Eq a => [a] -> [a] -> [[a]]
splitOnNaive delim 
    | null delim = error "splitOn: empty delimiter"
    | otherwise = apo coalgSplit . (,) id
    where
    coalgSplit = \case
        (ch, []) -> Cons (ch []) (Left [])
        (ch, li@(x:xs)) -> case stripPrefix delim li of
            Just ys -> Cons (ch []) (Right (id, ys))
            Nothing -> coalg (ch . (x :), xs)

Having to decide at each element whether to grow the current chunk or to start a new one is in itself problematic, as it breaks laziness.

[note 5]: Here is how Data.List.Extra implements breakOn. If we want to achieve that using a recursion-schemes unfold, one good strategy is defining a data structure that encodes exactly what we are trying to build:

data BrokenList a = Broken [a] | Unbroken a (BrokenList a)
    deriving (Eq, Show, Functor, Foldable, Traversable)
makeBaseFunctor ''BrokenList

A BrokenList is just like a list, except that the empty list is replaced by the (non-recursive) Broken constructor, which marks the break point and holds the remainder of the list. Once generated by an unfold, a BrokenList can be easily folded into a pair of lists: the elements in the Unbroken values are consed into one list, and the list in Broken becomes the other one:

breakOn :: ([a] -> Bool) -> [a] -> ([a], [a])
breakOn p = hylo algPair coalgBreak
    where
    coalgBreak = \case
        [] -> BrokenF []
        li@(x:xs)
            | p li -> BrokenF li
            | otherwise -> UnbrokenF x xs
    algPair = \case 
        UnbrokenF x ~(xs, ys) -> (x : xs, ys)
        BrokenF ys -> ([], ys)

(hylo, short for hylomorphism, is simply an ana followed by a cata, i.e. an unfold followed by a fold. hylo, as implemented in recursion-schemes, takes advantage of the fact that the intermediate data structure, created by the unfold and then immediately consumed by the fold, can be fused away, leading to significant performance gains.)

It is worth mentioning that the lazy pattern match in algPair is crucial to preserve laziness. The Data.List.Extra implementation linked to above achieves that by using first from Control.Arrow, which also matches the pair given to it lazily.

这篇关于在Haskell中实现分裂的一行代码的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆