正则表达式效率/性能比较这些不同类型的结构:\btheor(y\b|ies\b) X \btheory\b|\btheories\b [英] Regex efficience/performance comparing these different types of constructions: \btheor(y\b|ies\b) X \btheory\b|\btheories\b

查看:34
本文介绍了正则表达式效率/性能比较这些不同类型的结构:\btheor(y\b|ies\b) X \btheory\b|\btheories\b的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在编写一个表达式,其中构建块将遵循这两种类型之一(如果您有建议,也可以使用其他类型):

I am writing an expression in which the building blocks would follow one of these two types (or other ones if you have suggestions):

表达式 1:\btheor(y\b|ies\b)

表达式 2:\btheory\b|\btheories\b(...所写单词的所有变体)

Expression 2: \btheory\b|\btheories\b (...all variations of words written)

由于我是新手,刚开始使用 Regex,我的观点是从您那里了解整体效率,考虑到真正的表达式会像这样(例如表达式 1 结构)并在大型数据库中运行,Regex 搜索的更快输出/结果(5.000 个 PDF):

Since I am newbie, just started in Regex, my point is to know from you overall efficiency, faster output/result of the Regex search considering the real expression will be like this (e.g. expression 1 structure) and running in large database (5.000 PDFs):

theor(y\b|ies\b).{0,30}\bWORD\b|perspective(s?\b).{0,30}\bWORD\b|approa(ch\b|ches\b).{0,30}\bWORD\b|范式(s?\b|as?\b).{0,30}\bWORD\b|方法(s?\b).{0,30}\bWORD\b|m.todo(s?\b).{0,30}\bWORD\b|methodolog(y\b|ies\b).{0,30}\bWORD\b

推荐答案

我想开车回家@JonathonReinhart 的建议,所以我从复制开始:

I want to drive home @JonathonReinhart's advice, so I will start by copying:

Protip #1:保持简单.Protip #2:测试你的怀疑:使用你的语言或系统提供的任何工具对这两种方法进行编码和计时.如果更复杂的选项不是非常快,请不要使用它.

Protip #1: Keep it simple. Protip #2: Test your suspicions: Code up both methods and time them, using whatever facilities your language or system provides. If the more complex option is not drastically faster, forget about using it.

但是,要直接回答您的问题,表达式one 会更有效率.正则表达式从左读写并尝试吃掉(匹配)尽可能多的字符.假设您有字符串 theories.

However, to directly answer your question, expression one will be more efficient. Regular expressions read from left to write and try to eat up (match) as many characters as possible. So lets say you have the string theories.

使用theor(y|ies),regex会先成功匹配t, h, e,or.然后,它将尝试匹配 y 并失败.然后它将成功匹配ies.

With theor(y|ies), regex will first successfully match t, h, e, o, and r. Then, it will try to match y and fail. Then it will successfully match i, e, and s.

使用(theory)|(theories),regex会先成功匹配t, h, eor.y 失败后,它必须返回并重新匹配 t, h, eor 在匹配最终的 i、es 之前.

With (theory)|(theories), regex will first successfully match t, h, e, o, and r. After failing on y, it has to go back and re-match t, h, e, o, and r before matching the final i, e, and s.

希望它很容易看到第一个通过并尝试匹配的可能性较小,使其(稍微)更快.另外,我认为第一个表达式看起来更简洁......而且它更短的事实,如果你将很多这些组合在一起,将会对可读性产生很大的影响.

Hopefully its easy enough to see that the first one has less possibilities to go through and try to match, making it (ever so slightly) faster. Also, I think the first expression looks much cleaner..and the fact that it is shorter, will make a big difference in readability if you combine a lot of these together.

如果您无法从我的描述中看出这一点,请查看这些调试器:

If you can't visualize that from my descriptions, check out these debuggers:

您可以清楚地看到,随着表达式越来越接近成功(但仍然失败),第一个表达式的性能大大优于,因为它不必返回并重新开始.这里有一些关于回溯的宝贵读物(以及它如何严重导致性能问题).

You can clearly see that as the expression gets closer to succeeding (but still failing), the first expression greatly outperforms because it doesn't have to go back and start over. Here is some valuable reading on backtracking (and how it can drastically cause performance issues).

这篇关于正则表达式效率/性能比较这些不同类型的结构:\btheor(y\b|ies\b) X \btheory\b|\btheories\b的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆