Haskell:列表、数组、向量、序列 [英] Haskell: Lists, Arrays, Vectors, Sequences

查看:12
本文介绍了Haskell:列表、数组、向量、序列的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在学习 Haskell,并阅读了几篇关于 Haskell 列表和(插入您的语言)数组的性能差异的文章.

I'm learning Haskell and read a couple of articles regarding performance differences of Haskell lists and (insert your language)'s arrays.

作为一名学习者,我显然只是使用列表,甚至没有考虑性能差异.我最近开始调查并发现 Haskell 中有许多可用的数据结构库.

Being a learner I obviously just use lists without even thinking about performance difference. I recently started investigating and found numerous data structure libraries available in Haskell.

有人可以在不深入了解数据结构的计算机科学理论的情况下解释列表、数组、向量、序列之间的区别吗?

Can someone please explain the difference between Lists, Arrays, Vectors, Sequences without going very deep in computer science theory of data structures?

另外,是否有一些常见的模式,您会使用一种数据结构而不是另一种数据结构?

Also, are there some common patterns where you would use one data structure instead of another?

是否有任何其他形式的数据结构我遗漏了并且可能有用?

Are there any other forms of data structures that I am missing and might be useful?

推荐答案

Lists Rock

到目前为止,Haskell 中对顺序数据最友好的数据结构是 List

Lists Rock

By far the most friendly data structure for sequential data in Haskell is the List

 data [a] = a:[a] | []

列表给你 ϴ(1) 缺点和模式匹配.标准库,以及就此而言的前奏,充满了有用的列表函数,这些函数应该让你的代码乱七八糟(foldr,map,filter).列表是 persistant ,也就是纯函数式的,这非常好.Haskell 列表并不是真正的列表",因为它们是共同归纳的(其他语言称为这些流)所以诸如

Lists give you ϴ(1) cons and pattern matching. The standard library, and for that matter the prelude, is full of useful list functions that should litter your code (foldr,map,filter). Lists are persistant , aka purely functional, which is very nice. Haskell lists aren't really "lists" because they are coinductive (other languages call these streams) so things like

ones :: [Integer]
ones = 1:ones

twos = map (+1) ones

tenTwos = take 10 twos

工作出色.无限的数据结构摇滚.

work wonderfully. Infinite data structures rock.

Haskell 中的列表提供了一个很像命令式语言中的迭代器的接口(因为懒惰).因此,它们被广泛使用是有道理的.

Lists in Haskell provide an interface much like iterators in imperative languages (because of laziness). So, it makes sense that they are widely used.

列表的第一个问题是索引到它们(!!) 需要 ϴ(k) 时间,这很烦人.此外,追加可能很慢 ++,但 Haskell 的惰性求值模型意味着这些可以被视为完全摊销,如果它们发生的话.

The first problem with lists is that to index into them (!!) takes ϴ(k) time, which is annoying. Also, appends can be slow ++, but Haskell's lazy evaluation model means that these can be treated as fully amortized, if they happen at all.

列表的第二个问题是它们的数据局部性很差.当内存中的对象没有彼此相邻布置时,真正的处理器会产生高常量.因此,在 C++ 中,std::vector 比我所知道的任何纯链表数据结构都具有更快的snoc"(将对象放在最后),尽管这不是持久性数据结构,但不如Haskell 的列表.

The second problem with lists is that they have poor data locality. Real processors incur high constants when objects in memory are not laid out next to each other. So, in C++ std::vector has faster "snoc" (putting objects at the end) than any pure linked list data structure I know of, although this is not a persistant data structure so less friendly than Haskell's lists.

列表的第三个问题是它们的空间效率很差.一堆额外的指针会增加你的存储空间(以一个常数因子).

The third problem with lists is that they have poor space efficiency. Bunches of extra pointers push up your storage (by a constant factor).

Data.Sequence 在内部基于 手指树(我知道,你不想知道这个)这意味着它们有一些不错的特性

Data.Sequence is internally based on finger trees (I know, you don't want to know this) which means that they have some nice properties

  1. 纯功能.Data.Sequence 是一个完全持久化的数据结构.
  2. 快速访问树的开头和结尾.ϴ(1)(摊销)以获取第一个或最后一个元素,或附加树.在事物列表最快的地方,Data.Sequence 最多是一个常数慢.
  3. ϴ(log n) 访问序列的中间.这包括插入值以生成新序列
  4. 高质量的 API
  1. Purely functional. Data.Sequence is a fully persistant data structure.
  2. Darn fast access to the beginning and end of the tree. ϴ(1) (amortized) to get the first or last element, or to append trees. At the thing lists are fastest at, Data.Sequence is at most a constant slower.
  3. ϴ(log n) access to the middle of the sequence. This includes inserting values to make new sequences
  4. High quality API

另一方面,Data.Sequence 对数据局部性问题没有太大作用,仅适用于有限集合(它不如列表懒惰)

On the other hand, Data.Sequence doesn't do much for the data locality problem, and only works for finite collections (it is less lazy than lists)

数组是 CS 中最重要的数据结构之一,但它们不太适合懒惰的纯函数世界.数组提供 ϴ(1) 对集合中间的访问和非常好的数据局部性/常数因子.但是,由于它们不太适合 Haskell,因此使用起来很麻烦.当前标准库中实际上有多种不同的数组类型.这些包括完全持久性数组、IO monad 的可变数组、ST monad 的可变数组以及上述的未装箱版本.如需更多信息,请查看 haskell wiki

Arrays are one of the most important data structures in CS, but they dont fit very well with the lazy pure functional world. Arrays provide ϴ(1) access to the middle of the collection and exceptionally good data locality/constant factors. But, since they dont fit very well into Haskell, they are a pain to use. There are actually a multitude of different array types in the current standard library. These include fully persistant arrays, mutable arrays for the IO monad, mutable arrays for the ST monad, and un-boxed versions of the above. For more check out the haskell wiki

Data.Vector 包在更高级别和更清晰的 API 中提供了数组的所有优点.除非你真的知道你在做什么,否则如果你需要类似数组的性能,你应该使用这些.当然,一些警告仍然适用——像数据结构这样的可变数组在纯粹的惰性语言中表现不佳.尽管如此,有时您还是希望获得 O(1) 的性能,而 Data.Vector 会在一个可用的包中为您提供.

The Data.Vector package provides all of the array goodness, in a higher level and cleaner API. Unless you really know what you are doing, you should use these if you need array like performance. Of-course, some caveats still apply--mutable array like data structures just dont play nice in pure lazy languages. Still, sometimes you want that O(1) performance, and Data.Vector gives it to you in a useable package.

如果您只想要能够在末尾有效插入的列表,您可以使用 差异列表.列表搞砸性能的最好例子往往来自 [Char],前奏将其别名为 String.Char 列表很方便,但运行速度往往比 C 字符串慢 20 倍,所以可以随意使用 Data.Text 或非常快的 Data.ByteString.我确定还有其他面向序列的库,我现在没有想到.

If you just want lists with the ability to efficiently insert at the end, you can use a difference list. The best example of lists screwing up performance tends to come from [Char] which the prelude has aliased as String. Char lists are convient, but tend to run on the order of 20 times slower than C strings, so feel free to use Data.Text or the very fast Data.ByteString. I'm sure there are other sequence oriented libraries I'm not thinking of right now.

在 90+% 的情况下,我需要 Haskell 列表中的顺序集合是正确的数据结构.列表就像迭代器,使用列表的函数可以使用它们附带的 toList 函数轻松地与任何其他数据结构一起使用.在一个更好的世界中,前奏将完全参数化它使用的容器类型,但目前 [] 乱扔标准库.因此,(几乎)在任何地方都使用列表绝对没问题.
您可以获得大多数列表函数的完全参数化版本(并且使用它们是高尚的)

90+% of the time I need a sequential collection in Haskell lists are the right data structure. Lists are like iterators, functions that consume lists can easily be used with any of these other data structures using the toList functions they come with. In a better world the prelude would be fully parametric as to what container type it uses, but currently [] litters the standard library. So, using lists (almost) every where is definitely okay.
You can get fully parametric versions of most of the list functions (and are noble to use them)

Prelude.map                --->  Prelude.fmap (works for every Functor)
Prelude.foldr/foldl/etc    --->  Data.Foldable.foldr/foldl/etc
Prelude.sequence           --->  Data.Traversable.sequence
etc

事实上,Data.Traversable 定义了一个 API,该 API 或多或少适用于任何类似列表"的事物.

In fact, Data.Traversable defines an API that is more or less universal across any thing "list like".

不过,尽管您可以很优秀并且只编写完全参数化的代码,但我们大多数人都不是,而是到处使用列表.如果你正在学习,我强烈建议你也这样做.

Still, although you can be good and write only fully parametric code, most of us are not and use list all over the place. If you are learning, I strongly suggest you do too.

根据评论,我意识到我从未解释过何时使用 Data.VectorData.Sequence.数组和向量提供极快的索引和切片操作,但基本上是瞬态(命令式)数据结构.像 Data.Sequence[] 这样的纯函数式数据结构可以有效地从旧值中生成值,就像您修改了旧值一样.

Based on comments I realize I never explained when to use Data.Vector vs Data.Sequence. Arrays and Vectors provide extremely fast indexing and slicing operations, but are fundamentally transient (imperative) data structures. Pure functional data structures like Data.Sequence and [] let efficiently produce new values from old values as if you had modified the old values.

  newList oldList = 7 : drop 5 oldList

不修改旧列表,也不必复制它.所以即使 oldList 非常长,这种修改"也会非常快.同理

doesn't modify old list, and it doesn't have to copy it. So even if oldList is incredibly long, this "modification" will be very fast. Similarly

  newSequence newValue oldSequence = Sequence.update 3000 newValue oldSequence 

将产生一个新的序列,用 newValue for 代替它的 3000 个元素.同样,它不会破坏旧序列,它只是创建一个新序列.但是,它非常有效地执行此操作,取 O(log(min(k,k-n)) 其中 n 是序列的长度,k 是您修改的索引.

will produce a new sequence with a newValue for in the place of its 3000 element. Again, it doesn't destroy the old sequence, it just creates a new one. But, it does this very efficiently, taking O(log(min(k,k-n)) where n is the length of the sequence, and k is the index you modify.

您无法使用 VectorsArrays 轻松做到这一点.它们可以修改,但这是真正的命令式修改,因此不能在常规 Haskell 代码中完成.这意味着 Vector 包中的操作,如 snoccons 进行修改必须复制整个向量,所以取 O(n) 时间.唯一的例外是您可以在 ST monad(或 IO)中使用可变版本(Vector.Mutable)并执行所有操作您的修改就像在命令式语言中一样.完成后,您可以冻结"您的向量,以将其转换为要与纯代码一起使用的不可变结构.

You cant easily do this with Vectors and Arrays. They can be modified but that is real imperative modification, and so cant be done in regular Haskell code. That means operations in the Vector package that make modifications like snoc and cons have to copy the entire vector so take O(n) time. The only exception to this is that you can use the mutable version (Vector.Mutable) inside the ST monad (or IO) and do all your modifications just like you would in an imperative language. When you are done, you "freeze" your vector to turn in into the immutable structure you want to use with pure code.

我的感觉是,如果列表不合适,您应该默认使用 Data.Sequence.仅当您的使用模式不涉及进行大量修改,或者您需要 ST/IO monad 中的极高性能时,才使用 Data.Vector.

My feeling is that you should default to using Data.Sequence if a list is not appropriate. Use Data.Vector only if your usage pattern doesn't involve making many modifications, or if you need extremely high performance within the ST/IO monads.

如果所有关于 ST monad 的讨论让你感到困惑:那就更有理由坚持纯粹的快速和漂亮的 Data.Sequence.

If all this talk of the ST monad is leaving you confused: all the more reason to stick to pure fast and beautiful Data.Sequence.

这篇关于Haskell:列表、数组、向量、序列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆