如何一次有效地将输入字符串与多个正则表达式匹配? [英] How to efficiently match an input string against several regular expressions at once?

查看:49
本文介绍了如何一次有效地将输入字符串与多个正则表达式匹配?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

如何有效地将一个输入字符串与任意数量的正则表达式进行匹配?

How would one efficiently match one input string against any number of regular expressions?

这可能有用的一个场景是使用 REST Web 服务.假设我已经为 REST Web 服务的公共接口提出了许多 URL 模式:

One scenario where this might be useful is with REST web services. Let's assume that I have come up with a number of URL patterns for a REST web service's public interface:

  • /user/with-id/{userId}
  • /user/with-id/{userId}/profile
  • /user/with-id/{userId}/preferences
  • /users
  • /users/who-signed-up-on/{date}
  • /users/who-signed-up-between/{fromDate}/and/{toDate}

其中 {…} 是命名占位符(如正则表达式捕获组).

where {…} are named placeholders (like regular expression capturing groups).

注意:这个问题不是关于上面的REST接口设计的好不好.(可能不是,但这在这个问题的上下文中应该无关紧要.)

可以假设占位符通常不会出现在模式的开头(但它们可以).也可以安全地假设任何字符串都不可能匹配多个模式.

It may be assumed that placeholders usually do not appear at the very beginning of a pattern (but they could). It can also be safely assumed that it is impossible for any string to match more than one pattern.

现在网络服务收到一个请求.当然,可以依次将请求的 URI 与一个 URL 模式匹配,然后与下一个匹配,依此类推;但对于必须检查的大量模式,这可能无法很好地扩展.

Now the web service receives a request. Of course, one could sequentially match the requested URI against one URL pattern, then against the next one, and so on; but that probably won't scale well for a larger number of patterns that must be checked.

是否有任何有效的算法?

Are there any efficient algorithms for this?

输入:

  • 一个输入字符串
  • 一组互斥"的正则表达式(即没有输入字符串可以匹配多个表达式)

输出:

  • 输入字符串匹配的正则表达式(如果有).

推荐答案

Aho-Corasick 算法 是一种非常快速的算法,用于将输入字符串与一组模式(实际上是关键字)进行匹配,这些模式在特里进行预处理和组织,以加快匹配速度.

The Aho-Corasick algorithm is a very fast algorithm to match an input string against a set of patterns (actually keywords), that are preprocessed and organized in a trie, to speedup matching.

支持正则表达式模式的算法有多种变体(即http://code.google.com/p/esmre/ 只是举个例子)可能值得一看.

There are variations of the algorithm to support regex patterns (ie. http://code.google.com/p/esmre/ just to name one) that are probably worth a look.

或者,您可以将 url 拆分成块,将它们组织成一棵树,然后拆分 url 以匹配并一次一个块地遍历树.{userId} 可被视为通配符,或匹配某些特定格式(即为 int).

Or, you could split the urls in chunks, organize them in a tree, then split the url to match and walk the tree one chunk at a time. The {userId} can be considered a wildcard, or match some specific format (ie. be an int).

当你到达一片叶子时,你知道你匹配的是哪个网址

When you reach a leaf, you know which url you matched

这篇关于如何一次有效地将输入字符串与多个正则表达式匹配?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆