在haskell中attoparsec或parsec [英] attoparsec or parsec in haskell

查看:151
本文介绍了在haskell中attoparsec或parsec的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我必须解析一些文件并将它们转换为某些预定义的数据类型。

Haskell似乎为此提供了两个包:


  1. attoparsec

  2. parsec

这两者之间有什么区别,哪一个更适合按照一些规则解析文本文件?

解决方案

Parsec



Parsec适用于面向用户的解析器:输入有限但错误消息很重要。这不是非常快,但如果你有小的投入,这应该不重要。例如,我会选择几乎任何编程语言工具的Parsec,因为 - 绝对地说 - 即使最大的源文件不是那么大,但错误消息真的很重要。

Parsec可以处理不同的输入类型,这意味着您可以将它与标准的 String 或流来自某种外部词法分析器的令牌。既然它可以使用 String ,它可以很好地处理Unicode;像 digit 和 letter 这样的内置基本解析器是支持Unicode的。



Parsec还附带了monad变换器,这意味着您可以将它分层放入monad堆栈中。例如,如果您想在解析过程中跟踪其他状态,这可能很有用。你也可以选择非确定性的解析,或者其他的东西 - monad变形金刚常用的魔法。



Attoparsec



Attoparsec比Parsec快得多。当您希望获得大量输入或性能时,您应该使用它。这对于网络代码(解析数据包结构),解析大量原始数据或使用二进制文件格式是非常好的。



Attoparsec可以使用 ByteString s,它们是二进制数据。这使它成为实现诸如二进制文件格式之类的东西的不错选择。然而,因为这是二进制数据,所以它不处理文本编码等问题。为此,您应该为 Text 使用attoparsec模块。



Attoparsec支持Parsec不支持的增量解析。这对于网络代码等特定应用程序非常重要,但对其他应用程序无关紧要。

Attorparsec的错误信息比Parsec差,并牺牲了一些性能高级功能。它专用于 Text ByteString ,所以您不能将它与来自自定义词法分析器的令牌一起使用。它也不是一个monad变压器。



哪一个?



最终,Parsec和Attoparsec迎合不同的壁龛。高级别的差异在于性能:如果您需要它,请选择Attoparsec;如果你不这样做,那么就去Parsec吧。



我常用的启发法是选择Parsec来编程语言,配置文件格式和用户输入以及几乎所有我想要的东西用正则表达式。这些通常是由手工生成的,所以解析器不需要扩展,但它们确实需要很好地报告错误。另一方面,我会选择Attoparsec for诸如实施网络协议,处理二进制数据和文件格式或读取大量自动生成的数据。你正在处理时间限制或大量数据的事情,通常不是由人直接编写的。



正如你所看到的,选择实际上往往很漂亮简单:用例不重叠很多。很可能,对于任何给定的应用程序来说,哪一个应用程序都很清楚。

I have to parse some files and convert them to some predefined datatypes.

Haskell seems to be providing two packages for that:

  1. attoparsec
  2. parsec

What is the difference between the two of them and which one is better suited for parsing a text file according to some rules?

解决方案

Parsec

Parsec is good for "user-facing" parsers: things where you have a bounded amount of input but error messages matter. It's not terribly fast, but if you have small inputs this shouldn't matter. For example, I would choose Parsec for virtually any programming language tools since--in absolute terms--even the largest source files are not that big but error messages really matter.

Parsec can work on different input types, which means you can use it with a standard String or with a stream of tokens from an external lexer of some sort. Since it can use String, it handles Unicode perfectly well for you; the built-in basic parsers like digit and letter are Unicode-aware.

Parsec also comes with a monad transformer, which means you can layer it in a monad stack. This could be useful if you want to keep track of additional state during your parse, for example. You could also go for more trippy effects like non-deterministic parsing, or something--the usual magic of monad transformers.

Attoparsec

Attoparsec is much faster than Parsec. You should use it when you expect to get large amounts of input or performance really matters. It's great for things like networking code (parsing packet structure), parsing large amounts of raw data or working with binary file formats.

Attoparsec can work with ByteStrings, which are binary data. This makes it a good choice for implementing things like binary file formats. However, the since this is for binary data, it does not handle things like text encoding; for that, you should use the attoparsec module for Text.

Attoparsec supports incremental parsing, which Parsec does not. This is very important for certain applications like networking code, but doesn't matter for others.

Attorparsec has worse error messages than Parsec and sacrifices some high-level features for performance. It's specialized to Text or ByteString, so you can't use it with tokens from a custom lexer. It also isn't a monad transformer.

Which One?

Ultimately, Parsec and Attoparsec cater to very different niches. The high-level difference is performance: if you need it, choose Attoparsec; if you don't, just go with Parsec.

My usual heuristic is choosing Parsec for programming languages, configuration file formats and user input as well as almost anything I would otherwise do with a regex. These are things usually produced by hand, so the parsers do not need to scale but they do need to report errors well.

On the other hand, I would choose Attoparsec for things like implementing network protocols, dealing with binary data and file formats or reading in large amounts of automatically generated data. Things where you're dealing with time constraints or large amounts of data, that are usually not directly written by a human.

As you see, the choice is actually often pretty simple: the use cases don't overlap very much. Chances are, it'll be pretty clear which one to use for any given application.

这篇关于在haskell中attoparsec或parsec的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆