RegExp和String组合崩溃Chrome [英] RegExp and String combination crashes Chrome

查看:66
本文介绍了RegExp和String组合崩溃Chrome的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有以下RegExp来验证电子邮件地址:

  ^ [A-Za-z0-9](( [_\.\  - ] + [A-ZA-Z0-9] +)*)@([A-ZA-Z0-9] +)(([\.\  - ] [A- ([A-Za-z] {2,})$ 

在基本电子邮件上运行它的效果非常好:

  / ^ [A-Za-z0-9] (([_\.\  - ] + [A-ZA-Z0-9] +)*)@([A-ZA-Z0-9] +)(([\.\  - ] [ A-ZA-Z0-9] +)*)\([A-ZA-Z] {2,})$ /测试('dave@the-taylors.org')。; 

但是在长字符串上运行它会导致Chrome崩溃:

  / ^ [A-Za-z0-9](([_ \.\  - ]?[a-zA-Z0-9] +)*)@ ([A-ZA-Z0-9] +)(([\.\  - ] + [A-ZA-Z0-9] +)*)。\([A-ZA-Z] {2, })$ /测试( 'dddddddddddddddddddddddddddddddddddddddd'); 

我注意到它在40个字符左右开始



这个RegExp的密集程度如何?

解决方案

好的,让我们看看这里发生了什么。幸运的是,你已经简化了问题,直到你应用正则表达式时会发生什么。

 (d +)* @ 

添加到字符串中

  ddddd 

显然无法匹配,因为 @ 缺失。但是,什么使得正则表达式引擎快速找出这个问题呢?



好的,(d +)* 本质上可以是由以下匹配完成(每组用空格分隔):

  ddddd 
dddd d
ddd dd
dd ddd
d dddd
ddd dd
dd dd dd
dd ddd
dd dd dd $ b $ dd dd dd
d ddd d
ddd dd
dd dd d
d dd dd
dd ddd
ddddd

所以,你有一种匹配字符串而不破坏字符串的方法,有四种方法将它分解成两个字符串,六种方法将它拆分为三个,四个将其拆分为四个,另一个拆分它分成五个字符串。所有这些组合必须由正则表达式引擎进行检查,然后才能声明匹配失败,最终必须面对以下 @



为什么它不能更快​​地显示出来?那么,一些正则表达式引擎可以用这样一个简单的例子来实现它。我敢打赌拉里沃尔已经覆盖了。但是你的实际正则表达式有点复杂,所以我想事先很难弄明白。另外,还有一种简单的方法可以确保所有这些组合尝试不会发生。但我会在稍后再讨论这个问题。



我一直在想,有多少这样的组合会比较长的字符串比那些小五 d s:



让我们取一串长度 m (可以剪切除了 m-1 不同的位置)。比方说 n = m-1 。然后你可以计算组合的数量如下:

  1 +(n!/(1!*(n-1) !))+(n!/(2!*(n-2)!))+ ... +(n!/(n!*(nn)!))

第一个和最后一个项目始终为1,但中间的项目可能会变得很大。我们来编写一个小的Python程序:

 >>>从数学导入阶乘作为f 
>>> def comb(n,x):
... return f(n)//(f(x)* f(n-x))
...
>>> def ways_to_split(len):
... return 1 + sum(comb(len-1,x)for x in range(1,len))
...
>> gt ;> ways_to_split(5)
16

好的,似乎有效。让我们试试更大的东西:

 >>> ways_to_split(10)
512
>>> ways_to_split(40)
549755813888
>>> ways_to_split(100)
633825300114114700748351602688

嘿,这里有一个模式: ways_to_split(n)等于 2 **(n-1)。请参阅数学SE 获取证明。无论如何。你的正则表达式有 O(2 ^ n)复杂度。好吧,现在你明白了为什么这可能需要一段时间......幸运的是,许多正则表达式引擎提供了保护:占有量词或原子捕获组。

 (d ++)* 

 (?> d +)* 

既能确保无论是 d + 匹配的都不会再次尝试其他组合。好消息,对吧?

好吧,如果您使用JavaScript,则不是。它不支持任何这些功能。



因此,您需要重写您的正则表达式不会像这样回溯:



而不是

 (([_ \.\  - ]?[a-zA -Z0-9] +)*)

使用

  [a-zA-Z0-9] +([_.-] [a-zA-Z0-9] +)* 



I have the following RegExp to validate an email address:

^[A-Za-z0-9](([_\.\-]?[a-zA-Z0-9]+)*)@([A-Za-z0-9]+)(([\.\-]?[a-zA-Z0-9]+)*)\.([A-Za-z]{2,})$

Running it on a basic email works beautifully:

/^[A-Za-z0-9](([_\.\-]?[a-zA-Z0-9]+)*)@([A-Za-z0-9]+)(([\.\-]?[a-zA-Z0-9]+)*)\.([A-Za-z]{2,})$/.test('dave@the-taylors.org');

But running it on a long string crashes Chrome:

/^[A-Za-z0-9](([_\.\-]?[a-zA-Z0-9]+)*)@([A-Za-z0-9]+)(([\.\-]?[a-zA-Z0-9]+)*)\.([A-Za-z]{2,})$/.test('dddddddddddddddddddddddddddddddddddddddd');

I've noticed it kicks in at around 40 characters

What is it about this RegExp which is so intensive?

解决方案

OK, let's see what's happening here. You have thankfully already simplified the problem down to what happens when you apply the regex

(d+)*@

to the string

ddddd

which clearly can't be matched because the @ is missing. But what's keeping the regex engine from figuring this out quickly?

Well, (d+)* in essence can be fulfilled by the following matches (each group separated by spaces):

ddddd
dddd d
ddd dd
dd ddd
d dddd
ddd d d
dd d dd
d d ddd
dd dd d
d dd dd
d ddd d
d d d dd
d d dd d
d dd d d
dd d d d
d d d d d

So you have one way of matching the string without breaking up the string, four ways to break it up into two strings, six ways to break it up into three, four to break it into four, and one to break it up into five strings. All these combinations have to be checked by the regex engine before it can declare failure to match when it finally has to face the following @.

Why doesn't it figure that out sooner? Well, some regex engines can probably do it with such a simplified example. I bet Larry Wall has that covered. But your actual regex is a bit more complex, so I guess that would be much harder to figure out beforehand. Plus, there is a simple way to ensure all this combination-trying doesn't happen. But I'll come back to that later.

I had been wondering how many such combinations there would be for a longer string than those puny five ds:

Let's take a string of length m (which can be cut apart in m-1 different positions). Let's say n = m-1. Then you can calculate the number of combinations as follows:

1 + (n!/(1! * (n-1)!)) + (n!/(2! * (n-2)!)) + ... + (n!/(n! * (n-n)!))

The first and last items are always 1, but the items in-between can get quite large. Let's write a small Python program:

>>> from math import factorial as f
>>> def comb(n,x):
...    return f(n) // (f(x) * f(n-x))
...
>>> def ways_to_split(len):
...    return 1 + sum(comb(len-1,x) for x in range(1,len))
...
>>> ways_to_split(5)
16

OK, seems to work. Let's try something bigger:

>>> ways_to_split(10)
512
>>> ways_to_split(40)
549755813888
>>> ways_to_split(100)
633825300114114700748351602688

Hey, there's a pattern here: ways_to_split(n) is equal to 2**(n-1). See Mathematics SE for a proof. Anyway. Your regex has O(2^n) complexity. Well, now you see why that might take a while...

Thankfully, many regex engines provide protection against this: possessive quantifiers or atomic capturing groups.

(d++)*

or

(?>d+)*

both ensure that whatever d+ has matched will not be relinquished again for trying other combinations. Good news, right?

Well, not if you use JavaScript. It doesn't support either of those features.

Therefore, you either need to rewrite your regex not to be susceptible to backtracking like this:

Instead of

(([_\.\-]?[a-zA-Z0-9]+)*)

use

[a-zA-Z0-9]+([_.-][a-zA-Z0-9]+)*

Or stop trying to use a regex to validate an e-mail address which doesn't work reliably, ever, anyway.

这篇关于RegExp和String组合崩溃Chrome的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆