正则表达式匹配Domain.CCTLD [英] Regex to match Domain.CCTLD

查看:198
本文介绍了正则表达式匹配Domain.CCTLD的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

有没有人知道正则表达式来匹配Domain.CCTLD?我不想要子域名,只有原子域。例如, docs.google.com 不匹配,但 google.com 没有。然而,这样的东西就像 .co.uk ,CCTLD这样的东西变得复杂。有没有人知道解决方案?感谢提前。



编辑:我已经意识到我也必须处理多个子域,如 john。 doe.google.co.uk 。现在需要一个解决方案:P。

解决方案

根据上面的评论,我将重新解释这个问题 - - 而不是使正确表达式与它们匹配,我们将创建一个匹配它们的函数,并应用该函数过滤域名列表,以仅包含第一类域,例如google.com,amazon.co.uk。



首先,我们需要一个TLD列表。正如格雷格所说,公众后缀列表是一个很好的开始。假设您已将列表解析为一个名为后缀的python数组。如果这不是你舒服的,评论,我可以添加一些代码,将做到这一点。

  suffixes = parse_suffix_list(suffix_list.txt)

现在我们需要一些代码来标识一个给定的域名是否匹配模式some-name.suffix:

  def is_domain(d):
后缀中的后缀:
如果d.endswith(后缀):
#获取基本域名,不带后缀
base_name = d [0 :-( suffix.length + 1)]
#如果它包含'。',它是一个子域。
如果不是base_name.contains('。'):
返回true
#如果我们到达这里,没有找到匹配
return false


Does anyone know a regular expression to match Domain.CCTLD? I don't want subdomains, only the "atomic domain". For example, docs.google.com doesn't get matched, but google.com does. However, this gets complicated with stuff like .co.uk, CCTLDs. Does anyone know a solution? Thanks in advance.

EDIT: I've realized I also have to deal with multiple subdomains, like john.doe.google.co.uk. Need a solution now more than ever :P.

解决方案

Based on your comment above, I'm going to reinterpret the question -- rather than making a regex that will match them, we'll create a function that will match them, and apply that function to filter a list of domain names to only include first class domains, e.g. google.com, amazon.co.uk.

First, we'll need a list of TLDs. As Greg mentioned, the public suffix list is a great place to start. Let's assume you've parsed the list into a python array called suffixes. If this isn't something your comfortable with, comment and I can add some code that will do it.

suffixes = parse_suffix_list("suffix_list.txt")

Now we'll need code that identifies whether a given domain name matches the pattern some-name.suffix:

def is_domain(d):
    for suffix in suffixes:
        if d.endswith(suffix):
            # Get the base domain name without suffix
            base_name = d[0:-(suffix.length + 1)]
            # If it contains '.', it's a subdomain. 
            if not base_name.contains('.'):
                return true
    # If we get here, no matches were found
    return false

这篇关于正则表达式匹配Domain.CCTLD的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆