为什么jQuery的电子邮件验证正则表达式如此简单? [英] Why is jQuery's email validation regex so simple?

查看:149
本文介绍了为什么jQuery的电子邮件验证正则表达式如此简单?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我们都知道正确地验证电子邮件的正则表达式将是相当复杂。但是,jQuery的验证插件具有较短的正则表达式(由 Scott Gonzalez 提供)。 ,只有几行:

  / ^((([az] | \d | [!#\ $ %安培;?\ * \ + \-\ / = \ ^ _` {\ |}〜] | [\\\ -\\\퟿\\\豈-\\\﷏\ uFDF0-\\\￯])
+(\。([az] | \d | [!#\ $%&'\ * \ + \-\ / = \ ?\ ^ _` {\ |}〜] | [\\\ -\\\퟿\\\豈-\\\﷏\\\ﷰ-\\\￯])+)*)|
((\\ \\x22)?((((\x20 | \x09)*(\x0d\x0a))(\x20 |?\x09)+)(([\x01-\x08\ x0b\x0c\x0e-\x1f\x7f] | \x21 |
[\x23-\x5b] | [\x5d-\x7e] | [\\\ -\\ \\\\x0b\x0c\x0b-\x7f] $ b $(\\\ b | [\\\ -\\\퟿\\ uF900-\\\﷏\\\ﷰ-\\\￯]))))*(((\x20 | \x09)*(\x0d\x0a))(\x20 |?\x09)+ )?
(\x22)))@((([az] | \d | [\\\ -\\\퟿\\\豈-\\\﷏\\\ﷰ-\\\￯])| (([az] | \d |
[\\\ -\\\퟿\\\豈-\\\﷏\\\ﷰ-\\\￯])([az] | \d | - | \\ \\。| _ |〜| [\\\ -\\\퟿\\\豈-\\\﷏\\\ﷰ-\\\￯])*
([az] | \d | [\\\ - \\\퟿\\\豈-\\\﷏\\\ﷰ-\\\￯])))\)+(([AZ]。|
[\\\ -\\\퟿\\\豈-\\\﷏\\\ﷰ-\\\￯])|(([az] | [\\\ -\\\퟿\\\豈-\\\﷏ \\\ﷰ-\\\￯])
([az] | \d | - | \。| _ |〜| [\\\ -\\\퟿\\\豈-\\\﷏\\\ﷰ -\\\￯])*([az] |
[\\\ -\\\퟿\\\豈-\\\﷏\\\ﷰ-\\\￯])))\。?$ /

与更知名的怪物相比,为什么这么简单有没有一个正则表达式会失败,另一个会成功(无论这些情况是有效还是无效的电子邮件)?

解决方案

正则表达式是一种自定义组合:




  • RFC 2234 ABNF

  • RFC 2396 URI通用语法(由RFC 3986引用)

  • RFC 2616超文本传输​​协议 - HTTP / 1.1

  • RFC 2822 Internet邮件格式

  • RFC 3987 IRI

  • RFC 3986 URI通用语法



Web Forms 2.0 正在起草,RFC 5322不存在。如果您查看RFC的写入顺序,您会注意到在写入Internet消息格式之后,IRI和URI的定义已更改。这意味着RFC 2822不支持当前的IRI定义。不幸的是,它不是一个简单的替代定义的任务,所以我不得不选择和使用哪些定义来使用哪些RFC。我还对要删除的内容做出选择(例如支持评论)。



正则表达式不完全手写。虽然我手动编写正则表达式的每一部分,但我编写了胶水。来自RFC的每个定义存储在变量中,复合定义利用存储更简单定义的变量(@Walf:这就是为什么有这么多子模式和ors)。



为了使问题复杂化,jQuery验证插件中使用的正则表达式的版本进一步修改,以解决规范有效地址与用户期望有效地址之间的差异。我没有记录我做了哪些修改。我答应了JörnZaefferer(验证插件的作者),我会写一个较新的脚本来生成正则表达式。新脚本将允许您指定您所做的并且不希望支持的选项(所需的TLD,特定TLD,IPv6,注释,过时定义,引用的本地名称等)。那是5年前。我开始一次,但从未完成。也许有一天我会的。到目前为止,我已经在GitHub上托管: https://github.com/scottgonzalez/regex-builder



如果你想要一个正则表达式验证电子邮件地址,我建议下面的正则表达式包含在HTML5规范




/^[a-zA-Z0-9.!#$%&'*+/=?^_`{|}~- ] + @ [A-ZA-Z0-9](?:[A-ZA-Z0-9 - ] {0,61} [A-ZA-Z0-9])?(?:\ [A- zA-Z0-9](?:[a-zA-Z0-9 - ] {0,61} [a-zA-Z0-9])?$ /



如果您使用正则表达式建立者并关闭所有选项,您将获得类似的东西。但是,自从我看了以后,已经有一年了,所以我不记得有什么区别了。






'还要指出,原始问题中的链接具体提到了RFC 822.尽管RFC 822将Arpanet升级到ARPA Internet非常棒,但这并不完全是现在的。互联网在过去三十年取得了一些进步,这个RFC已被取代了两次。我想按照最新的标准看到任何新的作品。






更新:



一位朋友问我为什么HTML5正则表达式不支持UTF-8。我从来没有问过Hixie,但是我认为是这样的:即使一些TLD在2000年开始支持IDN(国际域名),并且2005年编写了RFC 3987(IRI),当时RFC 5322是在2008年写的它只列出范围33-90和94-126中的字符作为有效的dtext(允许在域文字中使用的字符)。 HTML5基于RFC 5322,因此不支持UTF-8。当然,似乎很奇怪的是,RFC 5322不考虑IDN,但是即使在2008年的IDN实际上并不是可以使用的,这也是值得的。直到2010年,ICANN才批准了第一批IDN。然而,即使今天如果你想使用IDN,如果你真的希望像电子邮件和DNS这样的东西在全球范围内工作,那么你几乎需要使用Punycode完全摧毁你的域名。



更新2:



更新了HTML5正则表达式以匹配更新的规范,将标签长度限制从255个字符更改为63个字符,如 RFC 1034第3.5节


We all know that a regex to validate emails properly would be quite complicated. However, jQuery's validation plugin has a shorter regex (contributed by Scott Gonzalez), spanning only a few lines:

/^((([a-z]|\d|[!#\$%&'\*\+\-\/=\?\^_`{\|}~]|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])
+(\.([a-z]|\d|[!#\$%&'\*\+\-\/=\?\^_`{\|}~]|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])+)*)|
((\x22)((((\x20|\x09)*(\x0d\x0a))?(\x20|\x09)+)?(([\x01-\x08\x0b\x0c\x0e-\x1f\x7f]|\x21|
[\x23-\x5b]|[\x5d-\x7e]|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])|(\\([\x01-\x09\x0b\x0c\x0d-\x7f]
|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF]))))*(((\x20|\x09)*(\x0d\x0a))?(\x20|\x09)+)?
(\x22)))@((([a-z]|\d|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])|(([a-z]|\d|
[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])([a-z]|\d|-|\.|_|~|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])*
([a-z]|\d|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])))\.)+(([a-z]|
[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])|(([a-z]|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])
([a-z]|\d|-|\.|_|~|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])*([a-z]|
[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])))\.?$/

Why is this so 'simple' compared to the more well-known monstrosity? Are there cases where one regex would fail and the other would succeed (whether the cases are valid or invalid emails)?

解决方案

The regex is a custom combination of:

  • RFC 2234 ABNF
  • RFC 2396 URI Generic Syntax (obseleted by RFC 3986)
  • RFC 2616 Hypertext Transfer Protocol -- HTTP/1.1
  • RFC 2822 Internet Message Format
  • RFC 3987 IRI
  • RFC 3986 URI Generic Syntax

I wrote the regex when Web Forms 2.0 was being drafted and RFC 5322 did not exist. If you look at the order in which the RFCs were written, you'll notice that the definition for IRI and URI changed after Internet Message Format was written. This means that RFC 2822 does not support current IRI definitions. Unfortunately, it wasn't a simple task of just substituting definitions, so I had to pick and choose which definitions to use from which RFCs. I also made choices about what to remove (like support for comments).

The regex is not fully hand-written. While I did manually write every section of the regex, I scripted the "glue". Each definition from the RFCs is stored in a variable, with compound definitions utilizing the variables that store the simpler definitions (@Walf: this is why there are so many subpatterns and ors).

To complicate the matter, the version of the regex that is used in the jQuery Validation plugin is modified even further to account for differences between spec-valid addresses and user expectation of a valid address. I have no recollection of what modifications I made. I promised Jörn Zaefferer (the author of the validation plugin) that I would write a newer script to generate the regex. The new script would allow you to specify options for what you do and don't want to support (required TLD, specific TLDs, IPv6, comments, obsolete defintions, quoted local names, etc.). That was 5 years ago. I started it once, but never finished. Maybe one day I will. What I have so far is hosted on GitHub: https://github.com/scottgonzalez/regex-builder

If you want a regex for validating email addresses, I'd suggest the following regex which is included in the HTML5 specification:

/^[a-zA-Z0-9.!#$%&'*+/=?^_`{|}~-]+@[a-zA-Z0-9](?:[a-zA-Z0-9-]{0,61}[a-zA-Z0-9])?(?:\.[a-zA-Z0-9](?:[a-zA-Z0-9-]{0,61}[a-zA-Z0-9])?)*$/

If you use regex-builder and turn off all the options, you'll get something similar. But it's been about a year since I looked at that, so I don't remember what the differences are.


I'd also like to point out that the link in the original question specifically mentions RFC 822. While it's great that RFC 822 advanced us from Arpanet to the ARPA Internet, this isn't exactly current. The Internet has made a few advances in the past three decades and this RFC has been superseded twice. I'd like to see any new work following the latest standards.


UPDATE:

A friend asked me why the HTML5 regex doesn't support UTF-8. I've never asked Hixie about it, but I assume this is the reason: Even though some TLDs started to support IDNs (International Domain Names) in 2000 and RFC 3987 (IRI) was written in 2005, when RFC 5322 was written in 2008 it only listed characters in the ranges 33-90 and 94-126 as valid dtext (characters allowed for use in a domain literal). HTML5 is based on RFC 5322 and as a result there is no UTF-8 support. It certainly seems strange that RFC 5322 doesn't account for IDNs, but it's worth nothing that even in 2008 IDNs weren't actually usable. It wasn't until 2010 that ICANN approved the first set of IDNs. However, even today if you want to use an IDN, you pretty much need to completely destroy your domain name using Punycode if you actually want things like email and DNS to work globally.

UPDATE 2:

Updated HTML5 regex to match the updated spec, which changed label length limits from 255 characters to 63 characters, as specified in RFC 1034 section 3.5.

这篇关于为什么jQuery的电子邮件验证正则表达式如此简单?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆