(un)结构化文本文档的词法分析器/解析器 [英] lexers / parsers for (un) structured text documents

查看:127
本文介绍了(un)结构化文本文档的词法分析器/解析器的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

有很多用于脚本(即结构化计算机语言)的解析器和词法分析器.但是我正在寻找一种可以将(几乎)非结构化文本文档分解为较大部分的文档,例如章节,段落等

There are lots of parsers and lexers for scripts (i.e. structured computer languages). But I'm looking for one which can break a (almost) non-structured text document into larger sections e.g. chapters, paragraphs, etc.

一个人识别它们相对容易:目录,确认书或主体从哪里开始,并且有可能建立基于规则的系统来识别其中的一些(例如段落).

It's relatively easy for a person to identify them: where the Table of Contents, acknowledgements, or where the main body starts and it is possible to build rule based systems to identify some of these (such as paragraphs).

我不希望它是完美的,但是有人知道如此广泛的基于块"的词法分析器/解析器吗?还是您可以向我指出可能会有所帮助的文学方向?

I don't expect it to be perfect, but does any one know of such a broad 'block based' lexer / parser? Or could you point me in the direction of literature which may help?

推荐答案

许多轻量级标记语言,例如 markdown (SO顺带使用),重构文本和(可以说)

Many lightweight markup languages like markdown (which incidentally SO uses), reStructured text and (arguably) POD are similar to what you're talking about. They have minimal syntax and break input down into parseable syntactic pieces. You might be able to get some information by reading about their implementations.

这篇关于(un)结构化文本文档的词法分析器/解析器的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆