正确匹配IDN网址 [英] Properly Matching a IDN URL

查看:186
本文介绍了正确匹配IDN网址的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我需要帮助建立一个可以正确匹配自由文本中的URL的正则表达式。




  • 方案


    • 以下之一: ftp http https (是 ftps 协议?)


  • 可选 strong>(和可选的 pass

  • 主机(支持IDN)


    • 支持 www 子域(支持IDN)

    • 基本过滤的TLD( [a-zA-Z] {2,6} 足够我想)


  • 可选端口号码

  • 路径(可选,支持Unicode字符)

  • 查询(可选,支持Unicode字符)

  • 片段(可选,支持Unicode字符)



这是我可以找到关于子域的:


子域表示相对
依赖,而不是绝对依赖:例如,
,wikipedia.org包含组织域的
子域,
en.wikipedia.org包含子域
的域wikipedia.org。在
理论中,这个细分可以下降
到127级,每个DNS标签
最多可以包含63个字符,如
,因为整个域名不是
超过255
字符的总长度。


关于域名本身,我找不到任何可靠的来源,但我认为非IDN的正则表达式 (我不知道如何编写IDN兼容版本)是这样的:

  [0-9a-zA-Z] [0-9a-zA-Z\  - ] {2,62} 

有人可以用这个正则表达式来帮助我,还是指向一个很好的方向?

解决方案

Daring Fireball的名声John Gruber有 a发布最近,详细说明了他对一个良好的URL识别正则表达式字符串的追求。他想到的是:



\b(([\w-] +://?| www [。] )[^ \s()<>] +(?: \([\w\d] + \)|([^ [:PUNCT:] \s] | /)))



其中显然可以使用含Unicode的网址。您需要对其进行轻微修改,以获得您要查找的其余内容 - 方案,用户名,密码等。Alan Storm 写了一个解释Gruber正则表达式模式的一篇文章,我确实需要(正则表达式是如此写一次,不可否认)再次! / p>

I need help building a regular expression that can properly match an URL inside free text.

  • scheme
    • One of the following: ftp, http, https (is ftps a protocol?)
  • optional user (and optional pass)
  • host (with support for IDNs)
    • support for www and sub-domain(s) (with support for IDNs)
    • basic filtering of TLDs ([a-zA-Z]{2,6} is enough I think)
  • optional port number
  • path (optional, with support for Unicode chars)
  • query (optional, with support for Unicode chars)
  • fragment (optional, with support for Unicode chars)

Here is what I could find out about sub-domains:

A "subdomain" expresses relative dependence, not absolute dependence: for example, wikipedia.org comprises a subdomain of the org domain, and en.wikipedia.org comprises a subdomain of the domain wikipedia.org. In theory, this subdivision can go down to 127 levels deep, and each DNS label can contain up to 63 characters, as long as the whole domain name does not exceed a total length of 255 characters.

Regarding the domain name itself I couldn't find any reliable source but I think the regular expression for non-IDNs (I'm not sure how to write a IDN compatible version) is something like:

[0-9a-zA-Z][0-9a-zA-Z\-]{2,62}

Can someone help me out with this regular expression or point me to a good direction?

解决方案

John Gruber, of Daring Fireball fame, had a post recently that detailed his quest for a good URL-recognizing regex string. What he came up with was this:

\b(([\w-]+://?|www[.])[^\s()<>]+(?:\([\w\d]+\)|([^[:punct:]\s]|/)))

Which apparently does OK with Unicode-containing URLs, as well. You'd need to do the slight modification to it to get the rest of what you're looking for -- the scheme, username, password, etc. Alan Storm wrote a piece explaining Gruber's regex pattern, which I definitely needed (regex is so write-once-have-no-clue-how-to-read-ever-again!).

这篇关于正确匹配IDN网址的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆