如何一次有效地将输入字符串与多个正则表达式匹配? [英] How to efficiently match an input string against several regular expressions at once?
问题描述
如何有效地将一个输入字符串与任意数量的正则表达式进行匹配?
How would one efficiently match one input string against any number of regular expressions?
这可能有用的一个场景是使用 REST Web 服务.假设我已经为 REST Web 服务的公共接口提出了许多 URL 模式:
One scenario where this might be useful is with REST web services. Let's assume that I have come up with a number of URL patterns for a REST web service's public interface:
/user/with-id/
{userId}
/user/with-id/
{userId}
/profile
/user/with-id/
{userId}
/preferences
/users
/users/who-signed-up-on/
{date}
/users/who-signed-up-between/
{fromDate}
/and/
{toDate}
- …
其中 {…}
是命名占位符(如正则表达式捕获组).
where {…}
are named placeholders (like regular expression capturing groups).
注意:这个问题不是关于上面的REST接口设计的好不好.(可能不是,但这在这个问题的上下文中应该无关紧要.)
可以假设占位符通常不会出现在模式的开头(但它们可以).也可以安全地假设任何字符串都不可能匹配多个模式.
It may be assumed that placeholders usually do not appear at the very beginning of a pattern (but they could). It can also be safely assumed that it is impossible for any string to match more than one pattern.
现在网络服务收到一个请求.当然,可以依次将请求的 URI 与一个 URL 模式匹配,然后与下一个匹配,依此类推;但对于必须检查的大量模式,这可能无法很好地扩展.
Now the web service receives a request. Of course, one could sequentially match the requested URI against one URL pattern, then against the next one, and so on; but that probably won't scale well for a larger number of patterns that must be checked.
是否有任何有效的算法?
Are there any efficient algorithms for this?
输入:
- 一个输入字符串
- 一组互斥"的正则表达式(即没有输入字符串可以匹配多个表达式)
输出:
- 输入字符串匹配的正则表达式(如果有).
推荐答案
Aho-Corasick 算法 是一种非常快速的算法,用于将输入字符串与一组模式(实际上是关键字)进行匹配,这些模式在特里进行预处理和组织,以加快匹配速度.
The Aho-Corasick algorithm is a very fast algorithm to match an input string against a set of patterns (actually keywords), that are preprocessed and organized in a trie, to speedup matching.
支持正则表达式模式的算法有多种变体(即http://code.google.com/p/esmre/ 只是举个例子)可能值得一看.
There are variations of the algorithm to support regex patterns (ie. http://code.google.com/p/esmre/ just to name one) that are probably worth a look.
或者,您可以将 url 拆分成块,将它们组织成一棵树,然后拆分 url 以匹配并一次一个块地遍历树.{userId} 可被视为通配符,或匹配某些特定格式(即为 int).
Or, you could split the urls in chunks, organize them in a tree, then split the url to match and walk the tree one chunk at a time. The {userId} can be considered a wildcard, or match some specific format (ie. be an int).
当你到达一片叶子时,你知道你匹配的是哪个网址
When you reach a leaf, you know which url you matched
这篇关于如何一次有效地将输入字符串与多个正则表达式匹配?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!