最好定期防爆pression与ASP.NET 3.5认证电子邮件格式验证 [英] Best Regular Expression for Email Format Validation with ASP.NET 3.5 Validation
问题描述
我用两个下面的正前pressions进行测试与ASP.NET验证控件的有效电子邮件前pression。我想知道这是从性能的角度来看,较好的前pression,或者如果有人有更好的。
I've used both of the following Regular Expressions for testing for a valid email expression with ASP.NET validation controls. I was wondering which is the better expression from a performance standpoint, or if someone has better one.
- \w+([-+.']\w+)*@\w+([-.]\w+)*\.\w+([-.]\w+)*
- ^([0-9a-zA-Z]([-\.\w]*[0-9a-zA-Z])*@([0-9a-zA-Z][-\w]*[0-9a-zA-Z]\.)+[a-zA-Z]{2,9})$
我试图避免对的 BCL团队博客。
更新
根据反馈,我结束了创建一个函数来测试如果电子邮件是有效的:
Based on feedback I ended up creating a function to test if an email is valid:
Public Function IsValidEmail(ByVal emailString As String, Optional ByVal isRequired As Boolean = False) As Boolean
Dim emailSplit As String()
Dim isValid As Boolean = True
Dim localPart As String = String.Empty
Dim domainPart As String = String.Empty
Dim domainSplit As String()
Dim tld As String
If emailString.Length >= 80 Then
isValid = False
ElseIf emailString.Length > 0 And emailString.Length < 6 Then
'Email is too short
isValid = False
ElseIf emailString.Length > 0 Then
'Email is optional, only test value if provided
emailSplit = emailString.Split(CChar("@"))
If emailSplit.Count <> 2 Then
'Only 1 @ should exist
isValid = False
Else
localPart = emailSplit(0)
domainPart = emailSplit(1)
End If
If isValid = False OrElse domainPart.Contains(".") = False Then
'Needs at least 1 period after @
isValid = False
Else
'Test Local-Part Length and Characters
If localPart.Length > 64 OrElse ValidateString(localPart, ValidateTests.EmailLocalPartSafeChars) = False OrElse _
localPart.StartsWith(".") OrElse localPart.EndsWith(".") OrElse localPart.Contains("..") Then
isValid = False
End If
'Validate Domain Name Portion of email address
If isValid = False OrElse _
ValidateString(domainPart, ValidateTests.HostNameChars) = False OrElse _
domainPart.StartsWith("-") OrElse domainPart.StartsWith(".") OrElse domainPart.Contains("..") Then
isValid = False
Else
domainSplit = domainPart.Split(CChar("."))
tld = domainSplit(UBound(domainSplit))
' Top Level Domains must be at least two characters
If tld.Length < 2 Then
isValid = False
End If
End If
End If
Else
'If no value is passed review if required
If isRequired = True Then
isValid = False
Else
isValid = True
End If
End If
Return isValid
End Function
注:
- IsValidEmail是更严格的关于允许再RFC字符,但它不检查那些所有可能的字符无效使用
推荐答案
如果你想知道为什么这个问题是生成这么少的活动,这是因为有应与你开始思考表演前处理许多其他的问题。这些中,最重要的是你是否应该使用正则表达式验证电子邮件地址在所有 - 与共识是,你不应该。它更麻烦比大多数人预期,而且很可能毫无意义呢。
If you're wondering why this question is generating so little activity, it's because there are so many other issues that should be dealt with before you start thinking about performance. Foremost among those is whether you should be using regexes to validate email addresses at all--and the consensus is that you should not. It's much trickier than most people expect, and probably pointless anyway.
另一个问题是,你的两个正则表达式中种,他们可以匹配字符串的巨大变化。例如,第二个被锚定在两端的,但首先是不;这将匹配&GT;&GT;&GT;&GT; foo@bar.com<&LT;&LT;&LT;
,因为有东西,看起来像嵌在一个电子邮件地址。也许框架迫使正则表达式的整个字符串匹配,但如果是这样的话,为什么是第二个锚?
Another problem is that your two regexes vary hugely in the kinds of strings they can match. For example, the second one is anchored at both ends, but the first isn't; it would match ">>>>foo@bar.com<<<<
" because there's something that looks like an email address embedded in it. Maybe the framework forces the regex to match the whole string, but if that's the case, why is the second one anchored?
另一个区别是,第一个正则表达式使用 \\ W
贯穿始终,而第二个使用 [0-9A-ZA-Z]
在很多地方。在大多数的正则表达式的口味, \\ W
除了字母和数字的下划线相匹配,但在一些(包括.NET)也匹配的字母和数字,从家喻户晓的书写系统UNI code。
Another difference is that the first regex uses \w
throughout, while the second uses [0-9a-zA-Z]
in many places. In most regex flavors, \w
matches the underscore in addition to letters and digits, but in some (including .NET) it also matches letters and digits from every writing system known to Unicode.
有许多其他方面的差异,但是这是学术;无论这些正则表达式是非常好的。请参见这里的话题了很好的讨论,以及一个更好的正则表达式。
There are many other differences, but that's academic; neither of those regexes is very good. See here for a good discussion of the topic, and a much better regex.
再回到原来的问题,我没有看到的性能的问题和这两个正则表达式的。除了在BCL博客中引用了嵌套量词的反模式,你也应该留意的情况下正则表达式的两个或多个相邻部分可以匹配相同的字符集 - 例如,
Getting back to the original question, I don't see a performance problem with either of those regexes. Aside from the nested-quantifiers anti-pattern cited in that BCL blog entry, you should also watch out for situations where two or more adjacent parts of the regex can match the same set of characters--for example,
([A-Za-z]+|\w+)@
有没有像在要么你张贴的正则表达式的。由量词控制部件总是由未量化的其他部分破碎。这两个正则表达式会遇到一些可以避免的回溯,但也有很多更好的理由不是性能拒绝它们。
There's nothing like that in either of the regexes you posted. Parts that are controlled by quantifiers are always broken up by other parts that aren't quantified. Both regexes will experience some avoidable backtracking, but there are many better reasons than performance to reject them.
编辑:所以第二个正则表达式的是的受到灾难性的回溯;我应该在拍摄我的嘴关闭之前彻底测试过它。在该正则表达式左看右看,我不明白为什么你需要外星号在第一部分:
So the second regex is subject to catastrophic backtracking; I should have tested it thoroughly before shooting my mouth off. Taking a closer look at that regex, I don't see why you need the outer asterisk in the first part:
[0-9a-zA-Z]([-.\w]*[0-9a-zA-Z])*
所有这一切确实一点是确保第一个和最后一个字符是字母数字,同时允许两者之间的某个其他字符。这个版本做同样的事情,但它更迅速失败时不匹配是可能的:
All that bit does is make sure the first and last characters are alphanumeric while allowing some additional characters in between. This version does the same thing, but it fails much more quickly when no match is possible:
[0-9a-zA-Z][-.\w]*[0-9a-zA-Z]
这将可能足以消除回溯问题,但你也可以在@后使部分使用的原子团更有效的:
That would probably suffice to eliminate the backtracking problem, but you could also make the part after the "@" more efficient by using an atomic group:
(?>(?:[0-9a-zA-Z][-\w]*[0-9a-zA-Z]\.)+)[a-zA-Z]{2,9}
在换句话说,如果你已经匹配了所有你可以子看起来像域组件尾随点,接下来的部分看起来并不像一个顶级域名,不要打扰回溯。你将不得不放弃第一个字符是最后的点,你知道 [A-ZA-Z] {2,9}
将不匹配。
In other words, if you've matched all you can of substrings that look like domain components with trailing dots, and the next part doesn't look like a TLD, don't bother backtracking. The first character you would have to give up is the final dot, and you know [a-zA-Z]{2,9}
won't match that.
这篇关于最好定期防爆pression与ASP.NET 3.5认证电子邮件格式验证的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!