为什么差异列表比常规连接更高效? [英] Why are difference lists more efficient than regular concatenation?

查看:92
本文介绍了为什么差异列表比常规连接更高效?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我目前正在通过在线学习您的haskell 书籍,并且已经例如

 ((((a ++) b)++ c)++ d)++ e)++ f 

据说效率不高。作者提出的解决方案是使用定义为

  newtype的差异列表DiffList a = DiffList {getDiffList :: [ a]  - > [a]} 

实例Monoid(DiffList a)其中
mempty = DiffList(\ xs - > [] ++ xs)
(DiffList f)`mappend` (DiffList g)= DiffList(\ xs - > f(g xs))

我是努力去理解为什么在某些情况下DiffList在计算上比简单的级联更高效。有人能够简单地向我解释为什么上面的例子效率很低,DiffList解决这个问题的方式是什么? 解决方案

中的问题

 ((((a ++ b)++ c)++ d)++ e)+ + f 

是嵌套。 (++)的应用程序是左嵌套的,这很糟糕,右嵌套

  a ++(b ++(c ++(d ++(e ++ f))))

不会成为问题。这是因为(++)被定义为

  [] + + ys = ys 
(x:xs)++ ys = x:(xs ++ ys)

所以要找到使用哪个方程,实现必须深入到表达式树中

 (++)
/ \
(++)f
/ \
(++)e
/ \
(++)d
/ \\
(++)c
/ \
ab



<直到找到左操作数是否为空。如果它不是空的,它的头被取出并冒泡到顶部,但是左操作数的尾部保持不变,所以当需要连接的下一个元素时,同样的过程再次开始。



当连接是右嵌套的时候,(++)的左操作数总是在最上面,并且检查虚空/冒泡头部是O(1)。



但是,当连接是左嵌套的时候, n 深层,到达第一个元素,必须遍历树的 n 节点,对于结果的每个元素(来自第一个列表 n-1 来自第二个等)。



让我们考虑 a =hello

我们要评估 take 5 hi 。所以首先必须检查是否b
$ b

 (((a ++ b)++ c)++ d)++ e 

为空。为此,必须检查是否

 ((a ++ b)++ c)++ d 

为空。为此,必须检查是否

 (a ++ b)++ c 

为空。为此,必须检查是否

  a ++ b 

为空。为此,必须检查是否

  a 

为空。唷。它不是,所以我们可以再次冒出来,装配一下。

  a ++ b ='h':(ello ++ b)
(a ++ b)++ c ='h':((ello++ b)++ c)
((a ++ b)++ c) ++ d ='h':(((ello++ b)++ c)++ d)
(((a ++ b)++ c)++ d)++ e = 'h':(((ello++ b)++ c)++ d)++ e)
((((a ++ b)++ c)++ d)++ e)++ f ='h':((((ello++ b)++ c)++ d)++ e)++ f)

,对于'e',我们必须重复一遍,对于 'l' s ...



绘制树的一部分,冒泡就像这样:



$ $ $
$ $ $ $ $ $ $ $ $ $ $ $ $ h':ellob

变成第一个

 (++)
/ \
(:) c
/ \
'h'(++)
/ \
ello b

然后

 (:) 
/ \
'h'(++)
/ \
(++)c
/ \
ellob

一路回到顶端。最终成为顶层(:)的右侧子树的树的结构与原始树的结构完全相同,除非最左边的列表是空,当

$ $ $ $ $


$ b $

节点会折叠为 b



因此,如果您有短列表的左嵌套连接,则连接变为二次,因为要使连接的头部为O(嵌套深度)操作。通常,左嵌套的

 (...((a_d ++ a_ {d-1}))的连接++ a_ {d-2})...)++ a_2)++ a_1 

O(sum [i * length a_i | i < - [1 .. d]])以完全评估。



使用差异列表(为简化说明,不包含新类型包装),组合是否为左嵌套并不重要

 ((((a ++)。(b ++))。(c ++))。(d ++))。 (e ++)

或者右嵌套。一旦遍历嵌套到达(a ++),那么(++)会被挂起表达式树的顶部,所以得到 a 的每个元素都是O(1)。

实际上, ,只要你需要第一个元素,整个组合就会与差异列表重新关联, )。(b ++))。(c ++))。(d ++))。 (e ++)$ f

变成

< $($ ++ $)$($ ++ $)$($ ++ $)$ b((a ++)。(b ++))。(c ++))$(d ++)((e ++)f)
((a ++)。(b + +))$(c ++)((d ++)((e ++)f))
(a ++)$(b ++)((c ++)((d ++ )((e ++)f)))
a ++(b ++(c ++(d ++(e ++ f))))

之后,每个列表都是顶级(++)的左操作数,在之前的列表已被消耗。



重要的是,前置函数(a ++)可以开始产生结果而不检查它的参数,以便从

 ($)
/ \ (。)(e ++)
/ \
(。)(d ++)
(。)
(。)f
/ \
/ \
(。)(c ++)
/ \
(a ++)(b ++)

通过

 ($)--------- 
/ \
(。)($)$ (。)(d ++)(e ++)f
/ \
(。)(c ++)
/ \
(a ++)(b ++)

 ($)
/ \
(a ++)($)
/ \
(b ++)($)
/ \
(c ++)($)
/ \
(d ++)($)
/ \
(e ++)f

不需要知道任何关于最终列表 f 的组合函数,所以它只是一个 O(深度)重写。然后是顶级

$ $ $
$ a $ b $($) b $ b

变成

  / \ 
a stuff

以及全部可以一步获得 a 的元素。在这个例子中,我们有纯粹的左嵌套,只需要重写一次。如果不是(例如)(d ++),那个地方的函数就是一个左嵌套的组合,(((g ++ )。(h ++))。(i ++))。 (j ++)时,顶层重新关联会使它保持原状,并且当它成为顶层($)在所有以前的列表已被使用之后。



所有重关联所需的全部工作是 O(列表数),所以连接的总成本是 O(列表数+总和(地图长度列表))。 (这意味着你可以通过插入大量的左嵌套([] ++)来带来不好的表现。)





  newtype DiffList a = DiffList {getDiffList :: [a]  - > [a]} 

实例Monoid(DiffList a)其中
mempty = DiffList(\ xs - > [] ++ xs)
(DiffList f)`mappend` (DiffList g)= DiffList(\ xs - > f(g xs))

(b ++)DiffMap(a ++)`mappend` DiffList(b ++) 〜> DiffList((a ++)。(b ++))

请注意,它只对函数有效如果任意函数被包装在 DiffList 中,你不需要检查它们的参数来开始生成输出,你没有这样的效率保证。特别是,附加((++ a),包装或不包含)可以创建左嵌套树(++) DiffList 组合为右嵌套,那么您可以创建 O(n²)连接行为。构造函数被暴露。


I am currently working my way through the Learn you a haskell book online, and have come to a chapter where the author is explaining that some list concatenations can be ineffiecient: For example

((((a ++ b) ++ c) ++ d) ++ e) ++ f

Is supposedly inefficient. The solution the author comes up with is to use 'difference lists' defined as

newtype DiffList a = DiffList {getDiffList :: [a] -> [a] }

instance Monoid (DiffList a) where
    mempty = DiffList (\xs -> [] ++ xs)
    (DiffList f) `mappend` (DiffList g) = DiffList (\xs -> f (g xs))

I am struggling to understand why DiffList is more computationally efficient than a simple concatenation in some cases. Could someone explain to me in simple terms why the above example is so inefficient, and in what way the DiffList solves this problem?

解决方案

The problem in

((((a ++ b) ++ c) ++ d) ++ e) ++ f

is the nesting. The applications of (++) are left-nested, and that's bad, right-nesting

a ++ (b ++ (c ++ (d ++ (e ++f))))

would not be a problem. That is because (++) is defined as

[] ++ ys = ys
(x:xs) ++ ys = x : (xs ++ ys)

so to find which equation to use, the implementation must dive into the expression tree

             (++)
             /  \
          (++)   f
          /  \
       (++)   e
       /  \
    (++)   d
    /  \
 (++)   c
 /  \
a    b

until it finds out whether the left operand is empty or not. If it's not empty, its head is taken and bubbled to the top, but the tail of the left operand is left untouched, so when the next element of the concatenation is demanded, the same procedure starts again.

When the concatenations are right-nested, the left operand of (++) is always at the top, and checking for emptiness/bubbling up the head are O(1).

But when the concatenations are left-nested, n layers deep, to reach the first element, n nodes of the tree must be traversed, for each element of the result (coming from the first list, n-1 for those coming from the second etc.).

Let us consider a = "hello" in

hi = ((((a ++ b) ++ c) ++ d) ++ e) ++ f

and we want to evaluate take 5 hi. So first, it must be checked whether

(((a ++ b) ++ c) ++ d) ++ e

is empty. For that, it must be checked whether

((a ++ b) ++ c) ++ d

is empty. For that, it must be checked whether

(a ++ b) ++ c

is empty. For that, it must be checked whether

a ++ b

is empty. For that, it must be checked whether

a

is empty. Phew. It isn't, so we can bubble up again, assembling

a ++ b                             = 'h':("ello" ++ b)
(a ++ b) ++ c                      = 'h':(("ello" ++ b) ++ c)
((a ++ b) ++ c) ++ d               = 'h':((("ello" ++ b) ++ c) ++ d)
(((a ++ b) ++ c) ++ d) ++ e        = 'h':(((("ello" ++ b) ++ c) ++ d) ++ e)
((((a ++ b) ++ c) ++ d) ++ e) ++ f = 'h':((((("ello" ++ b) ++ c) ++ d) ++ e) ++ f)

and for the 'e', we must repeat, and for the 'l's too...

Drawing a part of the tree, the bubbling up goes like this:

            (++)
            /  \
         (++)   c
         /  \
'h':"ello"   b

becomes first

     (++)
     /  \
   (:)   c
  /   \
'h'   (++)
      /  \
 "ello"   b

and then

      (:)
      / \
    'h' (++)
        /  \
     (++)   c
     /  \
"ello"   b

all the way back to the top. The structure of the tree that becomes the right child of the top-level (:) finally, is exactly the same as the structure of the original tree, unless the leftmost list is empty, when the

 (++)
 /  \
[]   b

nodes is collapsed to just b.

So if you have left-nested concatenations of short lists, the concatenation becomes quadratic because to get the head of the concatenation is an O(nesting-depth) operation. In general, the concatenation of a left-nested

(...((a_d ++ a_{d-1}) ++ a_{d-2}) ...) ++ a_2) ++ a_1

is O(sum [i * length a_i | i <- [1 .. d]]) to evaluate fully.

With difference lists (sans the newtype wrapper for simplicity of exposition), it's not important whether the compositions are left-nested

((((a ++) . (b ++)) . (c ++)) . (d ++)) . (e ++)

or right-nested. Once you have traversed the nesting to reach the (a ++), that (++) is hoisted to the top of the expression tree, so getting at each element of a is again O(1).

In fact, the whole composition is reassociated with difference lists, as soon as you require the first element,

((((a ++) . (b ++)) . (c ++)) . (d ++)) . (e ++) $ f

becomes

((((a ++) . (b ++)) . (c ++)) . (d ++)) $ (e ++) f
(((a ++) . (b ++)) . (c ++)) $ (d ++) ((e ++) f)
((a ++) . (b ++)) $ (c ++) ((d ++) ((e ++) f))
(a ++) $ (b ++) ((c ++) ((d ++) ((e ++) f)))
a ++ (b ++ (c ++ (d ++ (e ++ f))))

and after that, each list is the immediate left operand of the top-level (++) after the preceding list has been consumed.

The important thing in that is that the prepending function (a ++) can start producing its result without inspecting its argument, so that the reassociation from

             ($)
             / \
           (.)  f
           / \
         (.) (e ++)
         / \
       (.) (d ++)
       / \
     (.) (c ++)
     / \
(a ++) (b ++)

via

           ($)---------
           /           \
         (.)           ($)
         / \           / \
       (.) (d ++) (e ++)  f
       / \
     (.) (c ++)
     / \
(a ++) (b ++)

to

     ($)
     / \
(a ++) ($)
       / \
  (b ++) ($)
         / \
    (c ++) ($)
           / \
      (d ++) ($)
             / \
        (e ++)  f

doesn't need to know anything about the composed functions of the final list f, so it's just an O(depth) rewriting. Then the top-level

     ($)
     / \
(a ++)  stuff

becomes

 (++)
 /  \
a    stuff

and all elements of a can be obtained in one step. In this example, where we had pure left-nesting, only one rewriting is necessary. If instead of (for example) (d ++) the function in that place had been a left-nested composition, (((g ++) . (h ++)) . (i ++)) . (j ++), the top-level reassociation would leave that untouched and this would be reassociated when it becomes the left operand of the top-level ($) after all previous lists have been consumed.

The total work needed for all reassociations is O(number of lists), so the overall cost for the concatenation is O(number of lists + sum (map length lists)). (That means you can bring bad performance to this too, by inserting a lot of deeply left-nested ([] ++).)

The

newtype DiffList a = DiffList {getDiffList :: [a] -> [a] }

instance Monoid (DiffList a) where
    mempty = DiffList (\xs -> [] ++ xs)
    (DiffList f) `mappend` (DiffList g) = DiffList (\xs -> f (g xs))

just wraps that so that it is more convenient to handle abstractly.

DiffList (a ++) `mappend` DiffList (b ++) ~> DiffList ((a ++) . (b++))

Note that it is only efficient for functions that don't need to inspect their argument to start producing output, if arbitrary functions are wrapped in DiffLists, you have no such efficiency guarantees. In particular, appending ((++ a), wrapped or not) can create left-nested trees of (++) when composed right-nested, so you can create the O(n²) concatenation behaviour with that if the DiffList constructor is exposed.

这篇关于为什么差异列表比常规连接更高效?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆