有效地查询一个字符串中对多个正则表达式 [英] Efficiently querying one string against multiple regexes

查看:203
本文介绍了有效地查询一个字符串中对多个正则表达式的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

比方说,我有10,000个正则表达式和一个字符串,我想看看字符串匹配任何人,并得到所有的比赛。 在平凡的方式做到这一点是只查询字符串逐一对所有的正则表达式。是否有一个更快,更有效的方法来做到这一点?

Lets say that I have 10,000 regexes and one string and I want to find out if the string matches any of them and get all the matches. The trivial way to do it would be to just query the string one by one against all regexes. Is there a faster,more efficient way to do it?

编辑: 我曾尝试与DFA的(法),将其代 这里的问题是,它只会给你一个单一的模式。如果我有一个字符串hello和模式[H | H] ELLO和,DFA将只匹配其中之一,但我希望他们两个人打

I have tried substituting it with DFA's (lex) The problem here is that it would only give you one single pattern. If I have a string "hello" and patterns "[H|h]ello" and ".{0,20}ello", DFA will only match one of them, but I want both of them to hit.

推荐答案

马丁Sulzmann 做了在这个领域相当多的工作。 他有一个HackageDB项目解释breifly的here 其使用的偏导数似乎是裁缝本作。

Martin Sulzmann Has done quite a bit of work in this field. He has a HackageDB project explained breifly here which use partial derivatives seems to be tailor made for this.

使用的语言是哈斯克尔,因此将很难转化为一个非功能性的语言,如果是这样的欲望(我想转换到许多其他FP语言仍是十分困难)。

The language used is Haskell and thus will be very hard to translate to a non functional language if that is the desire (I would think translation to many other FP languages would still be quite hard).

的code未基于转换成一连串的自动机,然后将它们组合起来,而不是它是基于正则表达式的符号操作本身

The code is not based on converting to a series of automata and then combining them, instead it is based on symbolic manipulation of the regexes themselves.

另外,code是非常多的实验和马丁已经不再是教授,但在有酬就业(1),所以可能不感兴趣/无法提供任何帮助或输入。

Also the code is very much experimental and Martin is no longer a professor but is in 'gainful employment'(1) so may be uninterested/unable to supply any help or input.


  1. 这是一个笑话! - 我喜欢教授,少聪明的人尝试去解决更多的机会,我得到报酬的

这篇关于有效地查询一个字符串中对多个正则表达式的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆