正则表达式编译器 [英] Regular Expression Compiler

查看:88
本文介绍了正则表达式编译器的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在我完成的工作中,我只需要使用正则表达式几次.然而,在那几次中,我发现了一种非常强大的表达方式,可以让我做一些非常有用的事情.

I have had the need to use regular expressions only a few times in the work that I have done. However, in those few times I discovered a very powerful form of expression that would enable me to do some extremely useful things.

问题是用于正则表达式的语言是错误的 - 句号.

The problem is that the language used for regular expressions is wrong - full stop.

从心理学的角度来看这是错误的 - 使用无实体的符号只能为那些具有明显记忆的人提供有用的参考.虽然句法规则已经清楚地列出,但根据我的经验和我从其他人那里学到的东西,开发一个成功运行的正则表达式可能被证明是一件很难做到的事情,除了最微不足道的情况.这是可以理解的,因为它是集合论的符号模拟,这是一个相当复杂的事情.

It is wrong from a psychological point of view - using disembodied symbols provides a useful reference only to those with an eidetic memory. Whilst the syntactic rules are clearly laid out, from my experience and what I have learnt from others, evolving a regular expression that functions successfully can prove to be a difficult thing to do in all but the most trivial situations. This is understandable since it is a symbolic analog for set theory, which is a fairly complicated thing.

可能证明困难的事情之一是将您正在处理的表达式分解为其离散部分.由于语言的性质,如果您不了解其主要目标,则可以通过多种方式阅读一个正则表达式,因此解释其他人的正则表达式很复杂.在自然语言研究中,我认为这被称为语用学.

One of the things that can prove difficult is dissolving the expression that you are working on into its discrete parts. Due to the nature of the language, it is possible to read one regular expression in multiple ways if you don't have an understanding of its primary goal so interpreting other people's regexes is complicated. In natural language study I believe this is called pragmatics.

我想问的问题是 - 有正则表达式编译器这样的东西吗?或者甚至可以建造一个?

The question I'd like to ask then is this - is there such a thing as a regular expression compiler? Or can one even be built?

从比喻的角度来看,可以将正则表达式视为汇编语言 - 有一些相似之处.是否可以设计一个编译器来将更自然的语言——更高级的语言——转化为正则表达式?然后在我的代码中,我可以在头文件中使用更高级别的语言定义我的正则表达式,并在必要时使用符号引用来引用它们.我和其他人可以从我的代码中引用头文件,并且更容易理解我试图用我的正则表达式实现的目标.

It could be possible to consider regexes, from a metaphorical point of view, as assembly language - there are some similarities. Could a compiler be designed that could turn a more natural language - a higher language - into regular expressions? Then in my code, I could define my regexes using the higher level language in a header file and reference them where necessary using a symbolic reference. I and others could refer from my code to the header file and more easily appreciate what I am trying to achieve with my regexes.

我知道从逻辑的角度来看这是可以做到的,否则计算机将无法实现,但如果您已经阅读了这么多,那么您会考虑花时间来实现它吗?

I know it can be done from a logical point of view otherwise computers wouldn't be possible but if you have read this far then would you consider investing the time in realising it?

推荐答案

1) Perl 允许在正则表达式上使用 /x 开关来启用注释和空格在正则表达式内部.这使得可以将复杂的正则表达式扩展到多行,使用缩进来表示块结构.

1) Perl permits the /x switch on regular expressions to enable comments and whitespace to be included inside the regex itself. This makes it possible to spread a complex regex over several lines, using indentation to indicate block structure.

2) 如果您不喜欢类似行噪声的符号本身,那么编写自己的构建正则表达式的函数并不太难.例如.在 Perl 中:

2) If you don't like the line-noise-resembling symbols themselves, it's not too hard to write your own functions that build regular expressions. E.g. in Perl:

sub at_start { '^'; }
sub at_end { '$'; }
sub any { "."; }
sub zero_or_more { "(?:$_[0])*"; }
sub one_or_more { "(?:$_[0])+"; }
sub optional { "(?:$_[0])?"; }
sub remember { "($_[0])"; }
sub one_of { "(?:" . join("|", @_) . ")"; }
sub in_charset { "[$_[0]]"; }       # I know it's broken for ']'...
sub not_in_charset { "[^$_[0]]"; }   # I know it's broken for ']'...

然后例如匹配带引号的字符串 (/^"(?:[^\\"]|\\.)*"/) 的正则表达式变为:

Then e.g. a regex to match a quoted string (/^"(?:[^\\"]|\\.)*"/) becomes:

at_start .
'"' .
zero_or_more(
    one_of(
        not_in_charset('\\\\"'),    # Yuck, 2 levels of escaping required
        '\\\\' . any
    )
) .
'"'

使用这种字符串构建函数"策略有助于将有用的构建块表示为函数(例如,上述正则表达式可以存储在名为 quoted_string() 的函数中,您可能还有其他函数用于可靠地匹配任何数值、电子邮件地址等).

Using this "string-building functions" strategy lends itself to expressing useful building blocks as functions (e.g. the above regex could be stored in a function called quoted_string(), you might have other functions for reliably matching any numeric value, an email address, etc.).

这篇关于正则表达式编译器的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆