在Haskell中实现分裂的一行代码 [英] One-line implementation of split in Haskell
问题描述
我想要的是下面这个(我认为它应该包含在前奏中,因为它在文本处理中非常有用):
split :: Eq a => [a] - > [a] - > [b]
$ / code>
例如:
split;; hello ;; world= [hello,world]
split
来自 Data.List.Utils
不在基础中。我觉得应该通过编写一些基本函数来实现一个简短而甜蜜的实现,但我无法弄清楚。我错过了什么?
可以这么说,最好的方法是检查短而甜的 splitOn
(或 split
,就像你和 MissingH 称呼它 - 在这里我将坚持使用< a href =https://hackage.haskell.org/package/split-0.2.3.3/docs/Data-List-Split.html#v:splitOn =nofollow noreferrer> split 和 附加 包)正在尝试写它[注1]。
$ b
(顺便说一句,我将使用 recursion-schemes 这个答案中的函数和概念,就像我发现系统化的东西有点帮助我思考这种问题如果有什么不清楚的话,请告诉我。)
splitOn :: Eq a => [a] - > [a] - > [a]]
一种编写递归函数的方法,可以从另一个数据结构 splitOn
确实是这样做的,首先询问是否通过在自下而上或自上而下 (对于列表,分别为从右到左和从左到右)。自下而上的散步更自然地表现为某种折叠:
foldr @ [] ::(a - > b→b)→> b - > [a] - > b
cata @ [_] ::(ListF a b - > b) - > [a] - > b
( cata
是 recurstion-schemes 表达了一个vanilla fold。 ListF ab - > b
函数,称为代数该术语指定了每个折叠步骤中会发生什么 data ListF ab = Nil | Cons ab
,所以在列表的情况下,代数等于两个第一个参数 foldr
合并成一个 - 二元函数对应于 Cons
大小写,并且折叠的种子为
另一方面,自上而下的散步有助于展开:
unfoldr ::(b - > Maybe(a,b)) - > b - > [a] - 在Data.List中找到
ana @ [_] ::(b - > ListF a b) - > b - > [a]
( ana
, anamorphism,是在 recursion-schemes 中展开的香草。 b - > ListF ab
函数是一个 coalgebra 。它指定了每个展开步骤会发生什么,对于一个列表,可能性是发出一个列表元素和一个更新的种子,或者生成一个空列表并终止展开。)
splitOn
是自下而上还是自上而下?为了实现它,我们需要在列表中的任何位置处向前看,以检查当前列表段是否以分隔符开始。这就是说,达成自顶向下的解决方案是合理的,即展开/变形。
使用写入 splitOn code>展开展示了另一个需要考虑的事情:您希望每个人展开步骤以生成完整列表块。如果不这样做,充其量会导致您不必要地将原始列表两次走过[注3];最糟糕的是,长列表块上的灾难性内存使用和堆栈溢出正在等待[注4]。一种方法是通过
breakOn
函数来实现,就像 Data.List.Extra
...
breakOn :: Eq a => [a] - > [a] - > ([a],[a])
...就像在Prelude中打破
,不同之处在于,不是将谓词应用于每个元素,而是检查剩余列表段是否具有第一个参数作为前缀[注释5]。
使用 breakOn
,我们可以编写一个恰当的 splitOn
实现 - 一个,以优化方式进行编译,匹配开始时提到的库的性能:
splitOnAtomic :: Eq a => [a] - > [a] - > [[a]]
splitOnAtomic delim
| null delim =错误splitOnAtomic:空分隔符
|否则= apo coalgSplit
其中
delimLen =长度限定
coalgSplit = \ case
[] - >缺点[](左[])
li - >
使用(ch,xs)= breakOn(delim`isPrefixOf`)li
在Cons ch(Right(drop delimLen xs))
( apo
,简称为apomorphism,是一种可以短路的展开式。展开步骤,而不是通常更新的种子 - 通过右标记 - 最终结果 - 由
左标记这里需要短路,因为在空列表的情况下,我们不希望通过返回
href =http://hackage.haskell.org/package/extra-1.6.4/docs/src/Data-List-Extra.html#splitOn =nofollow noreferrer> Nil
产生一个空列表,这会错误地导致 splitOn delim [] = []
- 也不会求助于 Cons [] []
- 这会产生无限的 []
。这个技巧直接对应于添加到 splitOn _ [] = [[]] Data。 List.Extra
执行。)
经过一些小小的弯路,我们现在可以解决您的实际问题。
splitOn
在很短的时间内编写起来很棘手,因为首先,它使用的递归模式并不是微不足道的;其次,良好的实施需要一些不方便打高尔夫球的细节;第三,看起来最好的实现依赖于 breakOn
,它不在 base 中。 注意:
[注1]:在这个答案中运行片段所需的杂注:
{ - #LANGUAGE LambdaCase# - }
{ - #语言DeriveFunctor# - }
{ - #语言DerivedFoldable# - }
{ - #LANGUAGE TemplateHaskell # - }
导入Data.Functor.Foldable
导入Data.Functor.Foldable.TH
导入Data.List
导入Data.Maybe
[note 2]:另一种类型可能是 Eq a = > NonEmpty a - > [a] - > NonEmpty [a]
,如果你想把精度放在一切之上。在这里我不会打扰,以避免不必要的干扰。
[注3]:在这个相当整洁的实现中,展开 - 第一个( ana coalgMark
)用 Nothing替换分隔符
,这样第二个(<$
splitOnMark :: Eq a(c $ c> apo coalgSplit
)可以以直接的方式拆分: => [a] - > [a] - > [[a]]
splitOnMark分界
| null delim =错误splitOnMark:空分隔符
|否则= apo coalgSplit。 ana coalgMark
where
coalgMark = \ case
[] - >无
li @(x:xs) - >案例strip
只是ys - >缺点无ys
无 - >缺点(只要x)xs
coalgSplit = \ case
[] - >缺点[](左[])
mxs - >
让(mch,mys)= break在cons中没有mxs
(catMaybes mch)(Right(drop 1 mys))
$ b $ <什么 apo
是什么 Left
和
这个实现具有相当可接受的性能,不过,通过优化,它比答案的主体中的一个慢一个(适度)常数因子慢。虽然... ...高尔夫球可能会更容易一些,但是...
[注4]:在这个单一展开实施中,使用一个代数递归地将每个块构建为(差异)列表:
splitOnNaive :: Eq a => [a] - > [a] - > [[a]]
splitOnNaive分隔符
| null delim =错误splitOn:空分隔符
|否则= apo coalgSplit。 (,)id
其中
coalgSplit = \ case
(ch,[]) - > Cons(ch [])(Left [])
(ch,li @(x:xs)) - >案例strip
只是ys - > Cons(ch [])(Right(id,ys))
Nothing - > coalg(ch。(x :),xs)
必须决定每个元素是否增长当前大块或开始一个新块本身是有问题的,因为它打破了懒惰。
[注5]: 这是数据的方法。 List.Extra
实现 breakOn
。如果我们想用recursion-schemes 展开实现,一个好的策略是定义一个数据结构,它准确地编码我们正在构建的内容:
数据BrokenList a = Broken [a] | Unbroken a(BrokenList a)
derived(Eq,Show,Functor,Foldable,Traversable)
makeBaseFunctor''BrokenList
BrokenList
就像一个列表,只是空列表被替换为(非递归) Broken
构造函数,它标记断点并保存列表的其余部分。一旦展开生成, BrokenList
可以轻松地折叠成一对列表: Unbroken
值中的元素被列入一个列表,并且 Broken
中的列表成为另一个:
breakOn ::([a] - > Bool) - > [a] - > ([a],[a])
breakOn p = hylo algPair coalgBreak
where
coalgBreak = \ case
[] - > BrokenF []
li @(x:xs)
| p li - > BrokenF li
|否则 - > UnbrokenF x xs
algPair = \ case
UnbrokenF x〜(xs,ys) - > (x:xs,ys)
BrokenF ys - > ([],ys)
( hylo
,简写为 ana
,后面跟着 cata
,即展开后跟一个折叠。<$在 recursion-schemes 中实现的c $ c> hylo 利用了这样一个事实,即中间数据结构由展开并由折叠立即消耗而创建, )
值得一提的是, algPair
对于保持懒惰至关重要。与上面链接的 Data.List.Extra
实现通过从 Control使用
,这也与懒惰地给它的一对匹配。 first
实现。箭头
What I want is the following one (which I think should be included in prelude since it is very useful in text processing):
split :: Eq a => [a] -> [a] -> [[a]]
e.g:
split ";;" "hello;;world" = ["hello", "world"]
split
from Data.List.Utils
isn't in base. I feel there should be a short-and-sweet implementation by composing a few base functions, but I can't figure it out. Am I missing something?
Arguably, the best way to check how feasible a short-and-sweet splitOn
(or split
, as you and MissingH call it -- here I will stick to the name used by the split and extra packages) is trying to write it [note 1].
(By the way, I will use recursion-schemes functions and concepts in this answer, as I find systematising things a bit helps me think about this kind of problem. Do let me know if anything is unclear.)
The type of splitOn
is [note 2]:
splitOn :: Eq a => [a] -> [a] -> [[a]]
One way to write a recursive function that builds one data structure from another, like splitOn
does, begins by asking whether to do it by walking the original structure in a bottom-up or a top-down way (for lists, that amounts to right-to-left and left-to-right respectively). A bottom-up walk is more naturally expressed as some kind of fold:
foldr @[] :: (a -> b -> b) -> b -> [a] -> b
cata @[_] :: (ListF a b -> b) -> [a] -> b
(cata
, short for catamorphism, is how recurstion-schemes expresses a vanilla fold. The ListF a b -> b
function, called an algebra in the jargon, specifies what happens in each fold step. data ListF a b = Nil | Cons a b
, and so, in the case of lists, the algebra amounts to the two first arguments of foldr
rolled into one -- the binary function corresponds to the Cons
case, and the seed of the fold, to the Nil
one.)
A top-down walk, on the other hand, lends itself to an unfold:
unfoldr :: (b -> Maybe (a, b)) -> b -> [a] -- found in Data.List
ana @[_] :: (b -> ListF a b) -> b -> [a]
(ana
, short for anamorphism, is the vanilla unfold in recursion-schemes. The b -> ListF a b
function is a coalgebra; it specifies what happens in each unfold step. For a list, the possibilities are either emitting a list element and an updated seed or generating an empty list and terminating the unfold.)
Should splitOn
be bottom-up or top-down? To implement it, we need to, at any given position in the list, look ahead in order to check whether the current list segment starts with the delimiter. That being so, it makes sense to reach for a top-down solution i.e. an unfold/anamorphism.
Playing with ways to write splitOn
as an unfold shows another thing to consider: you want each individual unfold step to generate a fully-formed list chunk. Not doing so will, at best, lead you to unnecessarily walk the original list twice [note 3]; at worst, catastrophic memory usage and stack overflows on long list chunks await [note 4]. One way to achieve that is through a breakOn
function, like the one in Data.List.Extra
...
breakOn :: Eq a => [a] -> [a] -> ([a], [a])
... which is like break
from the Prelude, except that, instead of applying a predicate to each element, it checks whether the remaining list segment has the first argument as a prefix [note 5].
With breakOn
at hand, we can write a proper splitOn
implementation -- one that, compiled with optimisations, matches in performance the library ones mentioned at the beginning:
splitOnAtomic :: Eq a => [a] -> [a] -> [[a]]
splitOnAtomic delim
| null delim = error "splitOnAtomic: empty delimiter"
| otherwise = apo coalgSplit
where
delimLen = length delim
coalgSplit = \case
[] -> Cons [] (Left [])
li ->
let (ch, xs) = breakOn (delim `isPrefixOf`) li
in Cons ch (Right (drop delimLen xs))
(apo
, short for apomorphism, is an unfold that can be short-circuited. That is done by emitting from an unfold step, rather than the usual updated seed -- signaled by Right
-- a final result -- signaled by Left
. Short-circuiting is needed here because, in the empty list case, we want neither to produce an empty list by returning Nil
-- which would wrongly result in splitOn delim [] = []
-- nor to resort to Cons [] []
-- which would generate an infinite tail of []
. This trick corresponds directly to the additional splitOn _ [] = [[]]
case added to the Data.List.Extra
implementation.)
After a few slight detours, we can now address your actual question. splitOn
is tricky to write in a short way because, firstly, the recursion pattern it uses isn't entirely trivial; secondly, a good implementation requires a few details that are inconvenient for golfing; and thirdly, what appears to be the best implementation relies crucially on breakOn
, which is not in base.
Notes:
[note 1]: Here are the imports and pragmas needed to run the snippets in this answer:
{-# LANGUAGE LambdaCase #-}
{-# LANGUAGE DeriveFunctor #-}
{-# LANGUAGE DeriveFoldable #-}
{-# LANGUAGE DeriveTraversable #-}
{-# LANGUAGE TypeFamilies #-}
{-# LANGUAGE TemplateHaskell #-}
import Data.Functor.Foldable
import Data.Functor.Foldable.TH
import Data.List
import Data.Maybe
[note 2]: An alternative type might be Eq a => NonEmpty a -> [a] -> NonEmpty [a]
, if one wants to put precision above all else. I won't bother with that here to avoid unnecessary distractions.
[note 3]: As in this rather neat implementation, which uses two unfolds -- the first one (ana coalgMark
) replaces the delimiters with Nothing
, so that the second one (apo coalgSplit
) can split in a straightforward way:
splitOnMark :: Eq a => [a] -> [a] -> [[a]]
splitOnMark delim
| null delim = error "splitOnMark: empty delimiter"
| otherwise = apo coalgSplit . ana coalgMark
where
coalgMark = \case
[] -> Nil
li@(x:xs) -> case stripPrefix delim li of
Just ys -> Cons Nothing ys
Nothing -> Cons (Just x) xs
coalgSplit = \case
[] -> Cons [] (Left [])
mxs ->
let (mch, mys) = break isNothing mxs
in Cons (catMaybes mch) (Right (drop 1 mys))
(What apo
is and what Left
and Right
are doing here will be covered a little further in the main body of the answer.)
This implementation has fairly acceptable performance, though with optimisations it is a slower than the one in the main body of the answer by a (modest) constant factor. It might be a little easier to golf this one, though...
[note 4]: As in this single unfold implementation, which uses a coalgebra that calls itself recursively to build each chunk as a (difference) list:
splitOnNaive :: Eq a => [a] -> [a] -> [[a]]
splitOnNaive delim
| null delim = error "splitOn: empty delimiter"
| otherwise = apo coalgSplit . (,) id
where
coalgSplit = \case
(ch, []) -> Cons (ch []) (Left [])
(ch, li@(x:xs)) -> case stripPrefix delim li of
Just ys -> Cons (ch []) (Right (id, ys))
Nothing -> coalg (ch . (x :), xs)
Having to decide at each element whether to grow the current chunk or to start a new one is in itself problematic, as it breaks laziness.
[note 5]: Here is how Data.List.Extra
implements breakOn
. If we want to achieve that using a recursion-schemes unfold, one good strategy is defining a data structure that encodes exactly what we are trying to build:
data BrokenList a = Broken [a] | Unbroken a (BrokenList a)
deriving (Eq, Show, Functor, Foldable, Traversable)
makeBaseFunctor ''BrokenList
A BrokenList
is just like a list, except that the empty list is replaced by the (non-recursive) Broken
constructor, which marks the break point and holds the remainder of the list. Once generated by an unfold, a BrokenList
can be easily folded into a pair of lists: the elements in the Unbroken
values are consed into one list, and the list in Broken
becomes the other one:
breakOn :: ([a] -> Bool) -> [a] -> ([a], [a])
breakOn p = hylo algPair coalgBreak
where
coalgBreak = \case
[] -> BrokenF []
li@(x:xs)
| p li -> BrokenF li
| otherwise -> UnbrokenF x xs
algPair = \case
UnbrokenF x ~(xs, ys) -> (x : xs, ys)
BrokenF ys -> ([], ys)
(hylo
, short for hylomorphism, is simply an ana
followed by a cata
, i.e. an unfold followed by a fold. hylo
, as implemented in recursion-schemes, takes advantage of the fact that the intermediate data structure, created by the unfold and then immediately consumed by the fold, can be fused away, leading to significant performance gains.)
It is worth mentioning that the lazy pattern match in algPair
is crucial to preserve laziness. The Data.List.Extra
implementation linked to above achieves that by using first
from Control.Arrow
, which also matches the pair given to it lazily.
这篇关于在Haskell中实现分裂的一行代码的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!