正确匹配IDN网址 [英] Properly Matching a IDN URL
问题描述
我需要帮助建立一个可以正确匹配自由文本中的URL的正则表达式。
- 方案
- 以下之一: ftp , http , https (是 ftps 协议?)
- 可选 strong>(和可选的 pass )
- 主机(支持IDN)
- 支持 www 和子域(支持IDN)
- 基本过滤的TLD(
[a-zA-Z] {2,6}
足够我想)
- 可选端口号码
- 路径(可选,支持Unicode字符)
- 查询(可选,支持Unicode字符)
- 片段(可选,支持Unicode字符)
这是我可以找到关于子域的:
子域表示相对
依赖,而不是绝对依赖:例如,
,wikipedia.org包含组织域的
子域,
en.wikipedia.org包含子域
的域wikipedia.org。在
理论中,这个细分可以下降
到127级,每个DNS标签
最多可以包含63个字符,如
,因为整个域名不是
超过255
字符的总长度。
关于域名本身,我找不到任何可靠的来源,但我认为非IDN的正则表达式 (我不知道如何编写IDN兼容版本)是这样的:
[0-9a-zA-Z] [0-9a-zA-Z\ - ] {2,62}
有人可以用这个正则表达式来帮助我,还是指向一个很好的方向?
Daring Fireball的名声John Gruber有 a发布最近,详细说明了他对一个良好的URL识别正则表达式字符串的追求。他想到的是:
\b(([\w-] +://?| www [。] )[^ \s()<>] +(?: \([\w\d] + \)|([^ [:PUNCT:] \s] | /)))
其中显然可以使用含Unicode的网址。您需要对其进行轻微修改,以获得您要查找的其余内容 - 方案,用户名,密码等。Alan Storm 写了一个解释Gruber正则表达式模式的一篇文章,我确实需要(正则表达式是如此写一次,不可否认)再次! / p> I need help building a regular expression that can properly match an URL inside free text. Here is what I could find out about sub-domains: A "subdomain" expresses relative
dependence, not absolute dependence:
for example, wikipedia.org comprises a
subdomain of the org domain, and
en.wikipedia.org comprises a subdomain
of the domain wikipedia.org. In
theory, this subdivision can go down
to 127 levels deep, and each DNS label
can contain up to 63 characters, as
long as the whole domain name does not
exceed a total length of 255
characters. Regarding the domain name itself I couldn't find any reliable source but I think the regular expression for non-IDNs (I'm not sure how to write a IDN compatible version) is something like: Can someone help me out with this regular expression or point me to a good direction? John Gruber, of Daring Fireball fame, had a post recently that detailed his quest for a good URL-recognizing regex string. What he came up with was this: Which apparently does OK with Unicode-containing URLs, as well. You'd need to do the slight modification to it to get the rest of what you're looking for -- the scheme, username, password, etc. Alan Storm wrote a piece explaining Gruber's regex pattern, which I definitely needed (regex is so write-once-have-no-clue-how-to-read-ever-again!). 这篇关于正确匹配IDN网址的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!
[a-zA-Z]{2,6}
is enough I think)
[0-9a-zA-Z][0-9a-zA-Z\-]{2,62}
\b(([\w-]+://?|www[.])[^\s()<>]+(?:\([\w\d]+\)|([^[:punct:]\s]|/)))