Haskell 中的高效字符串实现 [英] Efficient String Implementation in Haskell

查看:23
本文介绍了Haskell 中的高效字符串实现的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我目前正在自学 Haskell,我想知道在 Haskell 中处理字符串的最佳实践是什么.

I'm currently teaching myself Haskell, and I'm wondering what the best practices are when working with strings in Haskell.

Haskell 中默认的字符串实现是一个 Char 列表.根据 Real World Haskell,因为每个字符都是单独分配的(我认为这意味着 String 基本上是 Haskell 中的链表,但我不确定.)

The default string implementation in Haskell is a list of Char. This is inefficient for file input-output, according to Real World Haskell, since each character is separately allocated (I assume that this means that a String is basically a linked list in Haskell, but I'm not sure.)

但是如果默认的字符串实现对于文件 i/o 是低效的,那么在内存中处理字符串也是低效的吗?为什么或者为什么不?C 使用一个字符数组来表示一个字符串,我认为这将是大多数语言的默认处理方式.

But if the default string implementation is inefficient for file i/o, is it also inefficient for working with Strings in memory? Why or why not? C uses an array of char to represent a String, and I assumed that this would be the default way of doing things in most languages.

在我看来,String 的列表实现将占用更多内存,因为每个字符都需要开销,并且需要更多时间来迭代,因为需要取消引用指针才能到达下一个字符.但到目前为止我一直喜欢使用 Haskell,所以我想相信默认实现是有效的.

As I see it, the list implementation of String will take up more memory, since each character will require overhead, and also more time to iterate over, because a pointer dereferencing will be required to get to the next char. But I've liked playing with Haskell so far, so I want to believe that the default implementation is efficient.

推荐答案

在 Haskell 中高效处理字符串的最佳实践基本上是:使用 Data.ByteString/Data.ByteString.Lazy.

Best practices for working with strings performantly in Haskell are basically: Use Data.ByteString/Data.ByteString.Lazy.

http://hackage.haskell.org/packages/archive/bytestring/latest/doc/html/

就 Haskell 中默认字符串实现的效率而言,事实并非如此.每个 Char 代表一个 Unicode 代码点,这意味着每个 Char 至少需要 21 位.

As far as the efficiency of the default string implementation goes in Haskell, it's not. Each Char represents a Unicode codepoint which means it needs at least 21bits per Char.

由于一个String就是[Char],即Char的链表,所以表示String>s 的引用局部性很差,再次意味着 Strings 在内存中相当大,至少它是 N * (21bits + Mbits) 其中 N 是长度字符串的大小,M 是指针的大小(32、64,你有什么),与 Haskell 使用列表的许多其他地方不同,其他语言可能使用不同的结构(我在这里专门考虑控制流),String 不太可能被编译器优化为循环等.

Since a String is just [Char], that is a linked list of Char, it means Strings have poor locality of reference, and again means that Strings are fairly large in memory, at a minimum it's N * (21bits + Mbits) where N is the length of the string and M is the size of a pointer (32, 64, what have you) and unlike many other places where Haskell uses lists where other languages might use different structures (I'm thinking specifically of control flow here), Strings are much less likely to be able to be optimized to loops, etc. by the compiler.

虽然 Char 对应于代码点,但 Haskell 98 报告没有指定任何有关执行文件 IO 时使用的编码的信息,甚至没有指定任何默认值,更不用说更改它的方法了.在实践中,GHC 提供了一个扩展来做,例如二进制 IO,但无论如何您此时都会取消预订.

And while a Char corresponds to a codepoint, the Haskell 98 report doesn't specify anything about the encoding used when doing file IO, not even a default much less a way to change it. In practice GHC provides an extensions to do e.g. binary IO, but you're going off the reservation at that point anyway.

即使使用诸如在字符串前面添加这样的操作,String 在实践中也不太可能击败 ByteString.

Even with operations like prepending to front of the string it's unlikely that a String will beat a ByteString in practice.

这篇关于Haskell 中的高效字符串实现的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆