高尔夫代码:正则表达式解析器 [英] Code Golf: Regex parser
问题描述
今天的Code Golf挑战是创建尽可能少的字符的正则表达式解析器.
Today's Code Golf challenge is to create a regex parser in as few characters as possible.
不,我不是要您匹配Perl样式的正则表达式.毕竟,已经有一个非常可靠的解释器! :-)
No, I'm not asking you to match Perl-style regular expressions. There's already a very reliable interpreter for those, after all! :-)
您需要了解有关此挑战的正则表达式语法:
Here's all you need to know about regex syntax for this challenge:
- 术语被定义为单个文字字符或分组括号
()
中的正则表达式. -
*
(星号)字符表示上一个TERM上的 Kleene星号操作.这意味着零个或多个上一词串联在一起. -
+
(加号)字符表示便捷的快捷方式:a+
等同于aa*
,表示上一个术语中的一个或多个. -
?
(问号)字符代表零或上一项. -
|
(竖线)字符表示交替,这意味着可以在比赛中使用两侧的常规表达式. - 所有其他字符均假定为文字.您可以假定所有其他字符都在
[0-9A-Za-z]
之内(即所有英文字母数字).
- A term is defined as a single literal character, or a regular expression within grouping parentheses
()
. - The
*
(asterisk) character represents a Kleene star operation on the previous TERM. This means zero or more of the previous term, concatenated together. - The
+
(plus) character represents a convenient shortcut:a+
is equivalent toaa*
, meaning one or more of the previous term. - The
?
(question mark) character represents zero or one of the previous term. - The
|
(pipe) character represents an alternation, meaning that the REGULAR EXPRESSIONS on either side can be used in the match. - All other characters are assumed to be literal. You may assume that all other characters are within
[0-9A-Za-z]
(i.e., all English alphanumerics).
或者,换一种说法:*
/+
/?
具有最高优先级,然后是串联,然后是交替.由于交替的优先级低于连接的优先级,因此在不带括号的正则表达式中使用它会使它与每侧的完整正则表达式绑定.另一方面,*
和+
和?
仅适用于紧接的前一个术语.
Or, put another way: *
/+
/?
have highest precedence, then concatenation, then alternation. Since alternation has lower precedence than concatenation, its use within a regex without parentheses causes it to be bound to the full regex on each side. *
and +
and ?
, on the other hand, would just apply to the immediately preceding term.
您的难题是编写一个程序,该程序将编译或解释正则表达式(如上定义),然后针对该正则表达式测试许多字符串.
Your challenge is to write a program that will compile or interpret a regular expression (as defined above) and then test a number of strings against it.
我将把一切留给您.我的建议是,应该首先使用正则表达式,然后再针对它测试任意数量的字符串.但是如果您想使它持久,那很好.如果您想将所有内容放入命令行参数或stdin中,或将正则表达式放入命令行中,并将字符串放入stdin中,等等.仅显示一个或两个用法示例.
I'm leaving input up to you. My recommendation would be that the regex should probably come first, and then any number of strings to be tested against it; but if you want to make it last, that's fine. If you want to put everything in command-line arguments or into stdin, or the regex in command-line and the strings in stdin, or whatever, that's fine. Just show a usage example or two.
输出应为true
或false
,每行一个,以反映正则表达式是否匹配.
Output should be true
or false
, one per line, to reflect whether or not the regex matches.
注意:
- 我不需要这么说...但是不要使用您的语言使用任何正则表达式库!您需要自己编译或解释模式. (如果需要使用正则表达式来拆分或连接字符串,则不能使用它来直接解决问题,例如,将输入的正则表达式转换为语言正则表达式并使用它. )
- 此挑战的正则表达式必须完全匹配输入字符串. (等效地,如果您熟悉类似Perl的正则表达式,则假定所有匹配项都已在字符串的开始和结束锚定)
- 对于此挑战,所有特殊字符
()*+?|
均不应按字面意义出现.如果输入中出现一个,则可以安全地假设没有任何模式可以匹配所讨论的字符串. - 要测试的输入字符串应区分大小写.
- I shouldn't need to say this... but don't use any regex libraries in your language! You need to compile or interpret the pattern yourself. ( You may use regex if it's required for splitting or joining strings. You just can't use it to directly solve the problem, e.g., converting the input regex into a language regex and using that.)
- The regular expression must COMPLETELY match the input string for this challenge. (Equivalently, if you're familiar with Perl-like regex, assume that start- and end-of-string anchoring is in place for all matches)
- For this challenge, all of the special characters
()*+?|
are not expected to occur literally. If one comes up in the input, it is safe to assume that no pattern can match the string in question. - Input strings to test should be evaluated in a case-sensitive manner.
对于示例,我假设一切都在命令行参数中完成,首先是正则表达式. (如上所述,输入由您决定.)myregex
这里表示您对程序的调用.
For the examples, I'm assuming everything is done in command-line arguments, regex first. (As I said above, input is up to you.) myregex
here represents your invocation of the program.
> myregex easy easy Easy hard
true
false
false
> myregex ab*a aa abba abab b
true
true
false
false
> myregex 0*1|10 1 10 0110 00001
true
true
false
true
> myregex 0*(1|1+0) 1 10 0110 00001
true
true
true
true
> myregex a?b+|(a+b|b+a?)+ abb babab aaa aabba a b
true
true
false
true
false
true
注意:抱歉,忘记制作社区Wiki! :-(
NOTE: Sorry, forgot to make community wiki! :-(
推荐答案
GolfScript-254个字符
n%([]:B:$:_"()"@*{:I"()*+|?"[{}/]?[{[[0B$,+:B))\;)]_]+}{B)):ß;:B;qß(:ß;}{8q}{[[0ß0$,)]]+}:8{[[0B-1=:ß)]]+:$q}{ß>$ß<\([0+$,+]\++}:q{[[I$,:ß)]]+}]=~:$}/;{n+[0]:3\{:c;;3:1_:3;{,}{)[$=]_*2/{~\.{c={3|:3}*;}{;.1|1,\:1,<{+0}*;}if}/}/;}/;1$,?)"true""false"if n}%
有些直接,第一个循环将正则表达式转换为NFA,第二个循环运行.
Somewhat straightforwardly, the first loop converts the regex into an NFA, which the second loop runs.
Sun Aug 22 00:58:24 EST 2010
(271→266)更改了变量名称以删除空格
Sun Aug 22 01:07:11 EST 2010
(266→265)将[]
设置为变量
Sun Aug 22 07:05:50 EST 2010
(265→259)内联进行了空状态转换
Sun Aug 22 07:19:21 EST 2010
(259→256)最终状态变为隐式
Mon Feb 7 19:24:19 EST 2011
(256→254),使用"()""str"*
Sun Aug 22 00:58:24 EST 2010
(271→266) changed variable names to remove spaces
Sun Aug 22 01:07:11 EST 2010
(266→265) made []
a variable
Sun Aug 22 07:05:50 EST 2010
(265→259) made null state transitions inline
Sun Aug 22 07:19:21 EST 2010
(259→256) final state made implicit
Mon Feb 7 19:24:19 EST 2011
(256→254) using "()""str"*
$ echo "ab*a aa abba abab b"|tr " " "\n"|golfscript regex.gs
true
true
false
false
$ echo "0*1|10 1 10 0110 00001"|tr " " "\n"|golfscript regex.gs
true
true
false
true
$ echo "0*(1|1+0) 1 10 0110 00001"|tr " " "\n"|golfscript regex.gs
true
true
true
true
$ echo "a?b+|(a+b|b+a?)+ abb babab aaa aabba a b"|tr " " "\n"|golfscript regex.gs
true
true
false
true
false
true
$ echo "((A|B|C)+(a|(bbbbb)|bb|c)+)+ ABCABCaccabbbbbaACBbbb ABCABCaccabbbbbaACBbbbb"|tr " " "\n"|golfscript regex.gs
false
true
这篇关于高尔夫代码:正则表达式解析器的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!