用Java解析类似乳胶的语言 [英] Parsing latex-like language in Java

查看:103
本文介绍了用Java解析类似乳胶的语言的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试用Java编写一种类似于Latex的简单语言的解析器,即它包含许多非结构化文本,并且中间有两个\ commands {some} {parameters}. \\等转义序列也必须考虑在内.

I'm trying to write a parser in Java for a simple language similar to Latex, i.e. it contains lots of unstructured text with a couple of \commands[with]{some}{parameters} in between. Escape sequences like \\ also have to be taken into account.

我试图用JavaCC生成一个解析器,但是看起来像JavaCC这样的编译器只适合于高度结构化的代码(通常用于通用编程语言),而不适合像凌乱的类似Latex的标记.到目前为止,看来我必须低级编写自己的有限状态机.

I've tried to generate a parser for that with JavaCC, but it looks as if compiler-compilers like JavaCC were only suitable for highly structured code (typical for general-purpose programming languages), not for messy Latex-like markup. So far, it seems I have to go low level and write my own finite state machine.

所以我的问题是,解析大多数非结构化的输入并且之间只有几个类似Latex的命令的最简单方法是什么?

So my question is, what's the easiest way to parse input that is mostly unstructured, with only a few Latex-like commands in between?

使用有限状态机进入低级状态很困难,因为Latex命令可以嵌套,例如\ cmd1 {\ cmd2 {\ cmd3 {...}}}

Going low level with a finite state machine is difficult because the Latex commands can be nested, e.g. \cmd1{\cmd2{\cmd3{...}}}

推荐答案

您可以定义语法来接受Latex输入,在最差的转换中使用 just 字符作为标记.为此,JavaCC应该很好.

You can define a grammar to accept the Latex input, using just characters as tokens in the worst cast. JavaCC should be just fine for this purpose.

关于语法和解析器生成器的好处是,它可以解析FSA遇到问题的事物,尤其是嵌套结构.

The good thing about a grammar and a parser generator is that it can parse things that FSAs have trouble with, especially nested structures.

语法上的第一个切入点可能是(我不确定这是否是有效的JavaCC,但这是合理的EBNF):

A first cut at your grammar could be (I'm not sure this is valid JavaCC, but it is reasonable EBNF):

 Latex = item* ;
 item = command | rawtext ;
 command =  command arguments ;
 command = '\' letter ( letter | digit )* ;  -- might pick this up as lexeme
 letter = 'a' | 'b' | ... | 'z' ;
 digit= '0' | ...  | '9' ;
 arguments =  epsilon |  '{' item* '}' ;
 rawtext = ( letter | digit | whitespace | punctuationminusbackslash )+ ; -- might pick this up as lexeme
 whitespace = ' ' | '\t' | '\n' | '\:0D' ; 
 punctuationminusbackslash = '!' | ... | '^' ;

这篇关于用Java解析类似乳胶的语言的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆