为类似Markdown的语言实现解析器 [英] Implementing parser for markdown-like language

查看：94 发布时间：2020/5/6 4:12:40 python parsing grammar markup ebnf

本文介绍了为类似Markdown的语言实现解析器的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有一种标记语言，类似于markdown和SO所使用的一种语言.

I have markup language which is similar to markdown and the one used by SO.

旧版解析器基于正则表达式，是维护工作的噩梦，因此我想出了自己的基于EBNF语法并通过mxTextTools/SimpleParse实现的解决方案.

Legacy parser was based on regexes and was complete nightmare to maintain, so I've come up with my own solution based on EBNF grammar and implemented via mxTextTools/SimpleParse.

但是，某些令牌可能相互包含一些问题，我认为这样做没有正确"的方法.

However, there are issues with some tokens which may include each other, and I don't see a 'right' way to do it.

这是我语法的一部分:

newline          := "\r\n"/"\n"/"\r"
indent           := ("\r\n"/"\n"/"\r"), [ \t]
number           := [0-9]+
whitespace       := [ \t]+
symbol_mark      := [*_>#`%]
symbol_mark_noa  := [_>#`%]
symbol_mark_nou  := [*>#`%]
symbol_mark_nop  := [*_>#`]
punctuation      := [\(\)\,\.\!\?]
noaccent_code    := -(newline / '`')+
accent_code      := -(newline / '``')+
symbol           := -(whitespace / newline)
text             := -newline+
safe_text        := -(newline / whitespace / [*_>#`] / '%%' / punctuation)+/whitespace
link             := 'http' / 'ftp', 's'?, '://', (-[ \t\r\n<>`^'"*\,\.\!\?]/([,\.\?],?-[ \t\r\n<>`^'"*]))+
strikedout       := -[ \t\r\n*_>#`^]+
ctrlw            := '^W'+
ctrlh            := '^H'+
strikeout        := (strikedout, (whitespace, strikedout)*, ctrlw) / (strikedout, ctrlh)
strong           := ('**', (inline_nostrong/symbol), (inline_safe_nostrong/symbol_mark_noa)* , '**') / ('__' , (inline_nostrong/symbol), (inline_safe_nostrong/symbol_mark_nou)*, '__')
emphasis              := ('*',?-'*', (inline_noast/symbol), (inline_safe_noast/symbol_mark_noa)*, '*') / ('_',?-'_', (inline_nound/symbol), (inline_safe_nound/symbol_mark_nou)*, '_')
inline_code           := ('`' , noaccent_code , '`') / ('``' , accent_code , '``')
inline_spoiler        := ('%%', (inline_nospoiler/symbol), (inline_safe_nop/symbol_mark_nop)*, '%%')
inline                := (inline_code / inline_spoiler / strikeout / strong / emphasis / link)
inline_nostrong       := (?-('**'/'__'),(inline_code / reference / signature / inline_spoiler / strikeout / emphasis / link))
inline_nospoiler       := (?-'%%',(inline_code / emphasis / strikeout / emphasis / link))
inline_noast          := (?-'*',(inline_code / inline_spoiler / strikeout / strong / link))
inline_nound          := (?-'_',(inline_code / inline_spoiler / strikeout / strong / link))
inline_safe           := (inline_code / inline_spoiler / strikeout / strong / emphasis / link / safe_text / punctuation)+
inline_safe_nostrong  := (?-('**'/'__'),(inline_code / inline_spoiler / strikeout / emphasis / link / safe_text / punctuation))+
inline_safe_noast     := (?-'*',(inline_code / inline_spoiler / strikeout / strong / link / safe_text / punctuation))+
inline_safe_nound     := (?-'_',(inline_code / inline_spoiler / strikeout / strong / link / safe_text / punctuation))+
inline_safe_nop        := (?-'%%',(inline_code / emphasis / strikeout / strong / link / safe_text / punctuation))+
inline_full           := (inline_code / inline_spoiler / strikeout / strong / emphasis / link / safe_text / punctuation / symbol_mark / text)+
line                  := newline, ?-[ \t], inline_full?
sub_cite              := whitespace?, ?-reference, '>'
cite                  := newline, whitespace?, '>', sub_cite*, inline_full?
code                  := newline, [ \t], [ \t], [ \t], [ \t], text
block_cite            := cite+
block_code            := code+
all                   := (block_cite / block_code / line / code)+

第一个问题是，破坏者，强者和强调者可以任意顺序相互包含.可能以后我会需要更多这样的内联标记.

First problem is, spoiler, strong and emphasis can include each other in arbitrary order. And its possible that later I'll need more such inline markups.

我当前的解决方案只为每个组合(inline_noast，inline_nostrong等)创建单独的令牌，但是显然，随着标记元素数量的增加，此类组合的数量增长过快.

My current solution involves just creating separate token for each combination (inline_noast, inline_nostrong, etc), but obviously, number of such combinations grows too fast with growing number of markup elements.

第二个问题是，在某些不好的标记(例如__._.__*__.__...___._.____.__**___***)(大量随机放置的标记符号)的情况下，这些在强项/强调中的超前行为表现非常差.解析几分钟的此类随机文本需要花费几分钟的时间.

Second problem is that these lookaheads in strong/emphasis behave VERY poorly on some cases of bad markup like __._.__*__.__...___._.____.__**___*** (lots of randomly placed markup symbols). It takes minutes to parse few kb of such random text.

我的语法有问题吗?或者我应该使用其他解析器来完成此任务?

Is it something wrong with my grammar or I should use some other kind of parser for this task?

推荐答案

如果一件事包括另一件事，那么通常您将它们视为单独的标记，然后将其嵌套在语法中. Lepl(我写的 http://www.acooke.org/lepl )和PyParsing(即可能是最流行的纯Python解析器)都可以让您递归地嵌套事物.

If one thing includes another, then normally you treat them as separate tokens and then nest them in the grammar. Lepl (http://www.acooke.org/lepl which I wrote) and PyParsing (which is probably the most popular pure-Python parser) both allow you to nest things recursively.

因此在Lepl中，您可以编写类似以下代码:

So in Lepl you could write code something like:

# these are tokens (defined as regexps)
stg_marker = Token(r'\*\*')
emp_marker = Token(r'\*') # tokens are longest match, so strong is preferred if possible
spo_marker = Token(r'%%')
....
# grammar rules combine tokens
contents = Delayed() # this will be defined later and lets us recurse
strong = stg_marker + contents + stg_marker
emphasis = emp_marker + contents + emp_marker
spoiler = spo_marker + contents + spo_marker
other_stuff = .....
contents += strong | emphasis | spoiler | other_stuff # this defines contents recursively

那么，我希望您能看到内容如何与嵌套使用的强项，强调项等相匹配.

Then you can see, I hope, how contents will match nested use of strong, emphasis, etc.

对于最终的解决方案，还有很多要做的事情，而效率在任何纯Python解析器中都是一个问题(有些解析器是用C实现的，但可以从Python调用.这些解析器会更快，但可能使用起来比较棘手；我没有推荐任何东西，因为我还没有使用过它们.

There's much more than this to do for your final solution, and efficiency could be an issue in any pure-Python parser (There are some parsers that are implemented in C but callable from Python. These will be faster, but may be trickier to use; I can't recommend any because I haven't used them).

这篇关于为类似Markdown的语言实现解析器的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

为类似Markdown的语言实现解析器 [英] Implementing parser for markdown-like language

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

为类似Markdown的语言实现解析器 [英] Implementing parser for markdown-like language

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭