如何使用Alex和Happy进行Lex,解析和序列化为XML电子邮件 [英] How to Lex, Parse, and Serialize-to-XML Email Messages using Alex and Happy

查看:138
本文介绍了如何使用Alex和Happy进行Lex,解析和序列化为XML电子邮件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在努力能够输入任何电子邮件并输出等效的XML编码.

I am working toward being able to input any email message and output an equivalent XML encoding.

我从小处着手,其中一个电子邮件标题是发件人标题"

I am starting small, with one of the email headers -- the "From Header"

以下是发件人"标题的示例:

Here is an example of a From Header:

From: John Doe <john@doe.org>

我希望它转换为这种XML:

I want it transformed into this XML:

<From>
    <Mailbox>
        <DisplayName>John Doe</DisplayName>
        <Address>john@doe.org</Address>
    </Mailbox>
</From>

我想使用词法分析器"Alex"( http://www.haskell. org/alex/doc/html/)来拆分(标记)发件人"标题.

I want to use the lexical analyzer "Alex" (http://www.haskell.org/alex/doc/html/) to break apart (tokenize) the From Header.

我想使用解析器"Happy"( http://www.haskell.org/happy/)来处理令牌并生成解析树.

I want to use the parser "Happy" (http://www.haskell.org/happy/) to process the tokens and generate a parse tree.

然后,我想使用序列化程序遍历解析树并输出XML.

Then I want to use a serializer to walk the parse tree and output XML.

发件人标题的格式由Internet邮件格式(IMF),RFC 5322( http: //tools.ietf.org/html/rfc5322 ).

The format of the From Header is specified by the Internet Message Format (IMF), RFC 5322 (http://tools.ietf.org/html/rfc5322).

以下是发件人标题和所需XML输出的更多示例:

Here are a few more examples of From Headers and the desired XML output:

来自没有显示名称的标题:

From Header with no display name:

From: <john@doe.org>

所需的XML输出:

<From>
    <Mailbox>
        <Address>john@doe.org</Address>
    </Mailbox>
</From>

在没有显示名称且地址周围没有尖括号的标题中:

From Header with no display name and no angle brackets around the address:

From: john@doe.org

所需的XML输出:

<From>
    <Mailbox>
        <Address>john@doe.org</Address>
    </Mailbox>
</From>

从带有多个邮箱的标头中,每个邮箱都用逗号分隔:

From Header with multiple mailboxes, each separated by a comma:

From: <john@doe.org>, "Simon St. John" <simon@stjohn.org>, sally@smith.org

所需的XML输出:

<From>
    <Mailbox>
        <Address>john@doe.org</Address>
    </Mailbox>
    <Mailbox>
        <DisplayName>Simon St. John</DisplayName>
        <Address>simon@stjohn.org</Address>
    </Mailbox>
    <Mailbox>
        <Address>sally@smith.org</Address>
    </Mailbox>
</From>

RFC 5322表示注释的语法为:(…).这是包含注释的发件人标题:

RFC 5322 says that the syntax for comment is: ( … ). Here is a From Header containing a comment:

From: (this is a comment) "John Doe" <john@doe.org>

我希望在词法化过程中删除所有注释.

I want all comments removed during lexing.

所需的XML输出是这样:

The desired XML output is this:

<From>
    <Mailbox>
        <DisplayName>John Doe</DisplayName>
        <Address>john@doe.org</Address>
    </Mailbox>
</From>

RFC表示在发件人头中可能散布着折叠空白".这是一个From Header,第一行的From:标记,第二行的显示名称,第三行的地址:

The RFC says that there can be "folding whitespace" scattered throughout the From Header. Here is a From Header with the From: token on the first line, the display name on the second line, and the address on the third line:

From: 
    "John Doe" 
    <john@doe.org>

XML输出不应受到折叠空白的影响:

The XML output should not be affected by the folding whitespace:

<From>
    <Mailbox>
        <DisplayName>John Doe</DisplayName>
        <Address>john@doe.org</Address>
    </Mailbox>
</From>

RFC表示,地址中@字符后可以是括在方括号中的字符串,例如:

The RFC says that after the @ character in the address can be a string enclosed in brackets, such as this:

From: "John Doe" <john@[website]>

我必须承认我从未见过与此有关的电子邮件.尽管如此,RFC表示允许这样做,所以我当然希望我的词法分析器和解析器处理此类输入.这是所需的输出:

I must admit that I have never seen emails with that. Nonetheless, the RFC says it is allowed, so I certainly want my lexer and parser to handle such inputs. Here is the desired output:

<From>
    <Mailbox>
        <DisplayName>John Doe</DisplayName>
        <Address>john@[website]</Address>
    </Mailbox>
</From>

错误处理

如果发件人标题不正确,我希望生成一个错误.以下是错误的From Header和所需输出的几个示例:

Error Handling

I want an error generated if the From Header is incorrect. Here are a couple examples of erroneous From Headers and the desired output:

显示名称错误地放置在地址之后:

The display name is erroneously placed after the address:

From: <john@doe.org> "John Doe"

输出应指定发现错误的位置:

The output should specify the location that the error was discovered:

serialize: parse error at line 1 and column 22. Error occurred at "John Doe"

此发件人标题的显示名称前有一个错误的"23":

This From Header has an erroneous "23" before the display name:

From: 23 "John Doe" <john@doe.org>

同样,输出应指定发现错误的位置:

Again, the output should specify the location that the error was discovered:

serialize: parse error at line 1 and column 10. Error occurred at "John Doe"

请说明如何实现词法分析器,解析器和序列化器?

Would you please show how to implement the lexer, parser, and serializer?

推荐答案

将任务分为五个步骤:

第1步:为发件人头指定完整的权威BNF

Step #1: specify the complete, authoritative BNF for the From Header

第2步:创建词法分析函数lex,该函数将从标题"分解为一系列小块,例如from:displayNameangleAddress,等等.这些小块称为令牌

Step #2: create a lexical analysis function, lex, that breaks the From Header into a sequence of small chunks, such as from:, displayName, angleAddress, and so on. These small chunks are called tokens

lex :: String -> [Token]

第3步:定义一种数据类型From,以表示发件人标题

Step #3: define a data type, From, to represent the From Header

步骤4 :创建解析器函数parser,该函数使用令牌序列并生成类型为From

Step #4: create a parser function, parser, that consumes the sequence of tokens and produces a parse tree of type From

parse :: [Token] -> From

步骤5 :创建一个函数serialize,该函数遍历解析树并生成XML

Step #5: create a function, serialize, that walks the parse tree and generates XML

serialize :: From -> XML


步骤1:为数据格式指定完整的权威BNF

在RFC 5322中指定了From头的完整权威BNF.我提取了适用于From头的部分:


Step #1: specify the complete, authoritative BNF for the data format

The complete, authoritative BNF for the From header is specified in RFC 5322. I extracted the portions applicable to the From header:

http://www.xfront. com/parsing/RFC-5322/From-Header/From-Header-BNF.pdf

下面是一个示例,显示了如何对From标头进行标记:

Here is an example that shows how From headers will be tokenized:

将此标题标为令牌:

From: "John Doe" <john@doe.org>

词法分析器的输出是以下令牌列表:

The output of the lexer is this list of tokens:

[ 
  TokenFrom (AlexPn 0 1 1)
  , TokenDisplayName (AlexPn 6 1 7) "\"John Doe\""
  , TokenAngleAddress (AlexPn 17 1 18) "<john@doe.org>"
]

列表中的每个项目都包含令牌的标签,位置信息以及可选的值.位置信息是括号内的内容. "AlexPn"是指示这是位置信息的标签.接下来的三个数字表示令牌的位置:起始位置,行号和列号.

Each item in the list consists of a label for the token, position information, and then optionally a value. The position information is the stuff in parentheses. The "AlexPn" is a label that indicates this is position information. The next three numbers indicate the location of the token: start location, line number, and column number.

下面是BFN的词法分析器.观察BNF与令牌定义之间的一对一映射.例如,BNF具有以下生产规则:

Below is the lexer for the BFN. Observe the one-to-one mapping between the BNF and the token definitions. For example, the BNF has this production rule:

qcontent  = ( qtext  |  quoted-pair )

词法分析器具有以下令牌定义:

The lexer has this token definition:

@qcontent = ( $qtext | @quoted_pair )

除了微小的语法差异外,它们是相同的.那真的很强大.假设电子邮件发件人"的定义是正确的(即BNF是正确的),那么我们可以肯定的是该词法分析器将是正确的.

Aside from minor syntactic differences, they are identical. That is really powerful. Assuming the definition of the email "From header" is correct (i.e., the BNF is correct), then we can be pretty certain that the lexer will be correct.

这是词法分析器:

http://www.xfront.com/解析/RFC-5322/From-Header/Lexer.x.txt

令牌序列将在数据类型中使用此内部表示:

The sequence of tokens will be internally represented using this from data type:

data From
    = From MailboxList
    deriving Show

type MailboxList
    = [ Mailbox ]

data Mailbox
    = LongMailbox DisplayName AngleAddress
    | AngleMailbox AngleAddress
    | ShortMailbox AddressSpecification
    deriving Show

data DisplayName
    = DisplayName String
    deriving Show

data AngleAddress
    = AngleAddress String
    deriving Show

data AddressSpecification
    = AddressSpecification String
    deriving Show

步骤4:创建解析器-使用令牌序列并生成发件人"类型的解析树

下面是一个示例,显示了如何解析发件人标题:

Step #4: create a parser -- consume the sequence of tokens and produce a parse tree of type "From"

Here is an example that shows how From Headers will be parsed:

解析此发件人标题:

From: "John Doe" <john@doe.org>

解析器的输出是此解析树:

The output of the parser is this parse tree:

From 
    [
        LongMailbox 
            (DisplayName "John Doe") 
            (AngleAddress "john@doe.org")
    ]

这里是解析器:

http://www.xfront.com/parsing/RFC-5322/From-Header/Parser.y.txt

每个语法产生都有一个功能.例如,这是From语法产生的函数:

There is a function for every grammar production. For example, here is the function for the From grammar production:

serialize :: From -> String
serialize (From mailboxList) = "<From>" ++ serializeMailboxList mailboxList ++ "</From>"

该函数的参数是解析树的根,该树的标签为From.该函数调用另一个函数serializeMailboxList来处理根的子代.结果包装在起始标签",结束标签"对中.

The function's argument is the root of the parse tree, which has the label, From. The function calls another function, serializeMailboxList, to process the children of the root. The result is wrapped within From start-tag, end-tag pairs.

这是XML序列化器:

http://www.xfront.com/parsing/RFC-5322/From-Header/serialize.hs.txt

这篇关于如何使用Alex和Happy进行Lex,解析和序列化为XML电子邮件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆