为什么这个正则表达式需要这么长时间才能在某些文件中找到电子邮件地址? [英] Why does this regex take so long to find email addresses in certain files?

查看:14
本文介绍了为什么这个正则表达式需要这么长时间才能在某些文件中找到电子邮件地址?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个查找电子邮件地址的正则表达式(这是从另一篇我找不到的 SO 帖子中获取的,并且已经在各种电子邮件配置上进行了测试......改变这不是我的问题......但要了解这是否是根本原因):

I have a regular expression that looks for email addresses ( this was taken from another SO post that I can't find and has been tested on all kinds of email configurations ... changing this is not exactly my question ... but understand if that is the root cause ):

/[a-z0-9_\-\+]+@[a-z0-9\-]+\.([a-z]{2,3})(?:\.[a-z]{2})?/i

我在 PHP 中使用 preg_match_all().

I'm using preg_match_all() in PHP.

这适用于 99.99...% 的文件我正在查看,大约需要 5 毫秒,但偶尔需要几分钟.这些文件比一般的网页大 30 万左右,但更大的文件通常可以正常处理.我能在文件内容中找到的唯一突出的是由数千个连续随机"字母数字字符组成的字符串,如下所示:

This works great for 99.99...% of files I'm looking in and takes around 5ms, but occasionally takes a couple minutes. These files are larger than the average webpage at around 300k, but much larger files generally process fine. The only thing I can find in the file contents that stands out is strings of thousands of consecutive "random" alphanumeric characters like this:

wEPDwUKMTk0ODI3Nzk5MQ9kFgICAw9kFgYCAQ8WAh4H...

这是导致问题的两个页面.查看源代码查看长字符串.

Here are two pages causing the problem. View source to see the long strings.

对导致这种情况的原因有任何想法吗?

Any thoughts on what is causing this?

--最终解决方案--

我测试了答案中建议的各种正则表达式.@FailedDev 的回答帮助并将处理时间从几分钟缩短到几秒钟.@hakre 的回答解决了这个问题,并将处理时间减少到几百毫秒.下面是我使用的最终正则表达式.这是@hakre 的第二个建议.

I tested various regexes suggested in the answers. @FailedDev's answer helped and dropped processing time from a few minutes to a few seconds. @hakre's answer solved the problem and reduced processing time to a few hundred milliseconds. Below is the final regex I used. It's @hakre's second suggestion.

/[a-z0-9_\-\+]{1,256}+@[a-z0-9\-]{1,256}+\.([a-z]{2,3})(?:\.[a-z]{2})?/i

推荐答案

您已经知道您的正则表达式会导致大文件出现问题.所以也许你可以让它更聪明一点?

You already know that your regex is causing an issue for large files. So maybe you can make it a bit smarter?

例如,您使用 + 来匹配一个或多个字符.假设您有一串 10 000 个字符.正则表达式必须查看 10 000 个组合才能找到最大的匹配项.然后你将它与类似的结合起来.假设您有一个包含 20 000 个字符和两个 + 组的字符串.他们怎么能在文件中匹配.可能有 10 000 x 10 000 种可能性.等等等等.

For example, you're using + to match one or more chars. Let's say you have a string of 10 000 chars. The regex must look 10 000 combinations to find the largest match. Then you combine it with similar ones. Let's say you have a string with 20 000 chars and two + groups. How could they match in the file. Probably 10 000 x 10 000 possibilities. And so on and so forth.

如果您可以限制字符数(这看起来有点像您正在寻找电子邮件模式),则可能将电子邮件地址域名限制为 256 个,将地址本身限制为 256 个字符.那么这将是仅"测试 256 x 256 的可能性:

If you can limit the number of characters (this looks a bit like you're looking for email patterns), probably limit the email address domain name to 256 and the address itself to 256 characters. Then this would be 256 x 256 possibilities to test "only":

/[a-z0-9_\-\+]{1,256}@[a-z0-9\-]{1,256}\.([a-z]{2,3})(?:\.[a-z]{2})?/i

那可能已经快得多了.然后使这些量词具有所有格将减少 PCRE 的回溯:

That's probably already much faster. Then making those quantifiers possessive will reduce backtracking for PCRE:

/[a-z0-9_\-\+]{1,256}+@[a-z0-9\-]{1,256}+\.([a-z]{2,3})(?:\.[a-z]{2})?/i

哪个应该再次加快速度.

Which should speed it up again.

这篇关于为什么这个正则表达式需要这么长时间才能在某些文件中找到电子邮件地址?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆