grep有效域正则表达式 [英] grep valid domain regex

查看:117
本文介绍了grep有效域正则表达式的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述



我的版本很好,但匹配以下无效域:

  @ subdom..dom.ext 

这是我的正则表达式:

  echo@ dom.ext| grep^ @ [[:alnum:]] \ + [[:alnum:] \-\。] \ + [[:alnum:]] \ + \。[[:alpha:]] \ + \ $

我正在使用bash,所以我逃避了特殊字符。 p>

应该匹配的示例:

  @ subdom.dom.ext 
@ subsubdom.subdom.dom.ext
@ subsub-dom.sub-dom.ext

感谢您的帮助

解决方案

一个真正完整的解决方案需要更多的工作,但这是一个镜头(注意一个 @ 前缀假设):

  ^ @(([a-zA-Z ( -  [A-ZA-Z0-9])*)\)* [A-ZA-Z]( - ?[A-ZA-Z0-9])+ \ [A-ZA- Z] {2,} $ 

您可以将其与 egrep (或 grep -E ),而且还与 [[... =〜...]] ,bash的正则表达式匹配运算符。



做出以下假设,比实际的DNS名称约束更容易:




  • 只有AS允许CII(非外国)信件 - 参见下文国际化域名(IDN)的考虑;也可以使用ASCII格式的IDN,例如 xn--bcher-kva.ch for bücher.ch

  • 嵌套子域名数量没有限制。

  • 没有限制任何标签的长度(名称组件),并且对名称的整体长度没有限制(实际限制,请参见 here )。

  • TLD(最后一个组件)仅由字母组成,长度至少为2。

  • 子域名和域名必须以字母开头;域名的长度必须至少为2;子域可以是单字母。



这是一个快速测试:

  for d in @ subdom..dom.ext @ dom.ext @ subdom.dom.ext @ subsubdom.subdom.dom.ext @ subsub-dom.sub-dom.ext; do 
[[$ d =〜\
^ @(([a-zA-Z]( - ?[a-zA-Z0-9])*)\。)* [a -zA-Z](-β[a-zA-Z0-9])+ \。[a-zA-Z] {2,} $ \
]]&&回声是||回复NO
完成






支持国际化域名(IDN)



同样符合IDN的简单改进是将 [a-zA-Z] 替换为 [[:alpha:] ] [a-zA-Z0-9] [[:alnum:]] 在上面的正则表达式中;即:

  ^ @(([[:alpha:]]( - ?[[:alnum:]])* \。)* [[:alpha:]]( - ?[[:alnum:]])+ \。[[:alpha:]] {2,} $ 

注意事项




  • p>并不是所有的Unix类平台完全支持所有的Unicode字符,当与 [[:alpha:]] [[:alnum:]] 。例如,使用基于UTF-8的语言环境,OS X 10.9.1显然只匹配拉丁语变音符号(例如,üá)和西里尔字符(除了ASCII),而Linux 3.2似乎涵盖了所有的脚本,包括亚洲和阿拉伯语。


  • 不清楚从右到左的写作脚本中的名称是否正确匹配。


  • 为了完整起见:即使上面的正则表达式没有尝试强制使用长度限制,试图用IDN这样做会更复杂,因为长度限制适用于名称的 ASCII 编码(通过 Punycode ),而不是原始的。




帽子的提示@Alfe指出IDN的问题。


I'm trying to make a regex for grep that match only valid domains.

My version work pretty well but match the following invalid domain :

@subdom..dom.ext

Here is my regex :

echo "@dom.ext" | grep "^@[[:alnum:]]\+[[:alnum:]\-\.]\+[[:alnum:]]\+\.[[:alpha:]]\+\$"

I'm working with bash so I escaped special characters.

Sample that should match :

@subdom.dom.ext
@subsubdom.subdom.dom.ext
@subsub-dom.sub-dom.ext

Thanks for help

解决方案

A truly complete solution takes more work, but here's a shot (note that a @ prefix is assumed):

^@(([a-zA-Z](-?[a-zA-Z0-9])*)\.)*[a-zA-Z](-?[a-zA-Z0-9])+\.[a-zA-Z]{2,}$

You can use this with egrep (or grep -E), but also with [[ ... =~ ... ]], bash's regex-matching operator.

Makes the following assumptions, which are more permissive than actual DNS name constraints:

  • Only ASCII (non-foreign) letters are allowed - see below for Internationalized Domain Name (IDN) considerations; also, the ASCII forms of IDNs - e.g., xn--bcher-kva.ch for bücher.ch - are not matched (though it would be easy to fix that).
  • There's no limit on the number of nested subdomains.
  • There's no limit on the length of any label (name component), and no limit on the overall length of the name (for actual limits, see here).
  • The TLD (last component) is composed of letters only and has a length of at least 2.
  • Both subdomain and domain names must start with a letter; the domain name must have a length of at least 2; subdomains are allowed to be single-letter.

Here's a quick test:

for d in @subdom..dom.ext @dom.ext @subdom.dom.ext @subsubdom.subdom.dom.ext @subsub-dom.sub-dom.ext; do
 [[ $d =~ \
    ^@(([a-zA-Z](-?[a-zA-Z0-9])*)\.)*[a-zA-Z](-?[a-zA-Z0-9])+\.[a-zA-Z]{2,}$ \
 ]] && echo YES || echo NO
done


Support for Internationalized Domain Names (IDN):

A simple improvement to also match IDNs is to replace [a-zA-Z] with [[:alpha:]] and [a-zA-Z0-9] with [[:alnum:]] in the above regex; i.e.:

^@(([[:alpha:]](-?[[:alnum:]])*)\.)*[[:alpha:]](-?[[:alnum:]])+\.[[:alpha:]]{2,}$

Caveats:

  • Not all Unix-like platforms fully support all Unicode letters when matching against [[:alpha:]] or [[:alnum:]]. For instance, using UTF-8-based locales, OS X 10.9.1 apparently only matches Latin diacritics (e.g., ü, á) and Cyrillic characters (in addition to ASCII), whereas Linux 3.2 laudably appears to cover all scripts, including Asian and Arabic ones.

  • I'm unclear on whether names in right-to-left writing scripts are properly matched.

  • For the sake of completeness: even though the regex above makes no attempt to enforce length limits, attempting to do so with IDNs would be much more complex, as the length limits apply to the ASCII encoding of the name (via Punycode), not the original.

Tip of the hat to @Alfe for pointing out the problem with IDNs.

这篇关于grep有效域正则表达式的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆