regex - 提取域名和 TLD [英] regex - extract domain name and TLD

查看:54
本文介绍了regex - 提取域名和 TLD的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试从字符串中提取域名和 TLD(如果存在).

I'm trying to extract domain name and TLD (if it exists) from a string.

对于testing.co.uk",我想要一个包含值的数组:("testing", "co.uk")

For "testing.co.uk" I want to have an array with values: ("testing", "co.uk")

对于-testing.c",我想要一个带有值的数组:("testing")

For "-testing.c" I want to have an array with a value: ("testing")

对于test-ing.co.uk.com",我想要一个包含值的数组(test-ing",co.uk")

For "test-ing.co.uk.com" I want to have an array with values ("test-ing","co.uk")

规则很简单:

  • 域名的首尾字符不能为-"
  • TLD 必须至少有两个字符
  • TLD 部分可能有一个点字符."但前提是它后面至少有两个字母

我有这个:

  • (\w[-\w]*\w) - 提取域名的第一部分(工作)
  • \.(\w{2,}(\.?\w{2,})?) - 获取 TLD(不工作)
  • (\w[-\w]*\w) - First part that extracts the domain name (Working)
  • \.(\w{2,}(\.?\w{2,})?) - get the TLD (not working)

推荐答案

如果我们可以假设 TLD 最多有 2 个小节,在字符串的末尾(最后一个小节始终是 TLD 的一部分)并且中间小节的长度在 2 到 3 个字符之间.如果字符串中至少有一个不是 TLD 的小节,那么以下内容应该与大多数情况匹配.您对要求所有字母数字中间带有破折号的域的假设是正确的.每个段只能有 63 个字符长.

If we can make the assumption that the TLD is at most 2 subsections long, at the end of the string (the last subsection is always part of the TLD) and that the middle subsection is at between 2 and 3 chars long. That there is at least one subsection in the string that is not the TLD then the following should match most cases. Your assumption about domains requiring all alphanumeric with dashes in the middle is correct. Each segment can only be 63 chars long.

^((?:www\.)?(?:\w[-\w]{0-61}\w|\w)(?:\.\w[-\w]{0-61}\w|\w)*?)\.((?:\w{2-3}\.)?\w+))$

解释:

(?: ) 表示非捕获匹配,可以使用 +, *, ?在它上面,但它不会在您的回答中返回

(?: ) means a non-capturing match, you can use +, *, ? on it but it won't be returned in your answer

^$ 分别匹配字符串的开头和结尾

^ and $ match the start and end of the string respectively

{n-m} 类似于 * 或 + 但匹配特定数量的字符

{n-m} is like * or + but matches a specific number of chars

*? 表示匹配 0 个或多个匹配项,但不贪婪,因此匹配有效匹配所需的最少次数.这意味着可能与正则表达式任一侧匹配的小节将进入 TLD.

*? means match 0 or more matches, but is non-greedy so matches the least number of times required for a valid match. It means that subsections that could potentially be matched by either side of the regexp will go into the TLD.

(?:www\.)? 这是一个针对短域名的错误修正,例如 www.un.org

(?:www\.)? this is a bugfix for short domain names such as www.un.org

(?:\w[-\w]{0-61}\w|\w) 确保域部分中至少有一个小节,并且每个小节都在最大值63 个字符 (61+2=63).小节由外部括号捕获.末尾的 |\w 位解决了 x.org 和 i.net 等单字母域名的边缘情况.

(?:\w[-\w]{0-61}\w|\w) ensures that there is at least one subsection in the domain part and that each section is at max 63 chars (61+2=63). Subsection is captured by the outside brackets. The |\w bit at the end solves for the edge case of one letter domain names such as x.org and i.net.

(?:\.\w[-\w]{0-61}\w)*?|\w) 需要重复,因为第一小节不能以点开头.其中零个或多个是必需的,但要使其成为非贪婪搜索.

(?:\.\w[-\w]{0-61}\w)*?|\w) needs to be repeated as the first subsection cannot start with a dot. Zero or more of these are required, but make it a non-greedy search.

((?:\w{2-3}\.)?\w+) 根据上述规则匹配 TLD.最后一小节始终是 TLD 的一部分.关于什么构成二级 TLD 的规则更加模糊

((?:\w{2-3}\.)?\w+) matches the TLD according to the rules above. The last subsection is always part of the TLD. The rules on what constitutes a second level TLD are more fuzzy

这个正则表达式并非完全万无一失,因为有一些例外情况违反了上述规则.www.un.com 是具有短域名的单段 TLD 的一个示例.gmp.police.uk(大曼彻斯特警察局)是另一个域的示例,其中 TLD (police.uk) 将无法正确匹配(它将与 uk 匹配).

This regexp is not completely foolproof, as there are a few exceptions that violate the above rules. www.un.com is one example of a one segment TLD with a short domain name. gmp.police.uk (Greater Manchester Police) is an example of another domain where the TLD (police.uk) will not be properly matched (it will match as uk).

我已将 TLD 段的长度扩展到 {2-4},因为我们需要包含 .info 和 .mod.uk 等域.我已将第二个 TLD 段的长度减少到 {2-3},以减少四个字母域名的不匹配数量,我们对两个或三个字母的域名无能为力,但它们只会在以下情况下不匹配该域还包含一个子域,例如 blog.cat.com

I have expanded the length of TLD segments to {2-4} as we need to include domains such as .info and .mod.uk. I have reduced the length of the second TLD segment to {2-3} in order to reduce the number of mismatches on four letter domain names, not much we can do about two or three letter domain names, but they will only be mismatched if the domain also contains a subdomain such as blog.cat.com

以下列出了一些已在使用的 TLD,其中可能会突出显示一些边缘情况.我不认为有任何
http://en.wikipedia.org/wiki/List_of_Internet_top-level_domains
http://en.wikipedia.org/wiki/.uk

Here is a list of some of the TLDs already in use, which might highlight some of the edge cases. I don't think there are any
http://en.wikipedia.org/wiki/List_of_Internet_top-level_domains
http://en.wikipedia.org/wiki/.uk

这篇关于regex - 提取域名和 TLD的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆