Python urlparse-提取不带子域的域名 [英] Python urlparse -- extract domain name without subdomain

查看:183
本文介绍了Python urlparse-提取不带子域的域名的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

需要一种使用Python urlparse从URL中提取不包含子域的域名的方法.

Need a way to extract a domain name without the subdomain from a url using Python urlparse.

例如,我想从像"http://www.google.com"这样的完整网址中提取"google.com".

For example, I would like to extract "google.com" from a full url like "http://www.google.com".

urlparse似乎与我最接近的是netloc属性,但这包括子域,在此示例中为www.google.com.

The closest I can seem to come with urlparse is the netloc attribute, but that includes the subdomain, which in this example would be www.google.com.

我知道可以编写一些自定义字符串操作来将www.google.com转换为google.com,但是我想避免在此任务中进行手工字符串转换或正则表达式. (这样做的原因是,我对url形成规则不够熟悉,无法自信地考虑到在编写自定义解析函数时需要考虑的每一个极端情况.)

I know that it is possible to write some custom string manipulation to turn www.google.com into google.com, but I want to avoid by-hand string transforms or regex in this task. (The reason for this is that I am not familiar enough with url formation rules to feel confident that I could consider every edge case required in writing a custom parsing function.)

或者,如果urlparse不能满足我的需要,那么有人会知道其他Python URL解析库吗?

Or, if urlparse can't do what I need, does anyone know any other Python url-parsing libraries that would?

推荐答案

您可能想签出 tldextract ,旨在执行此类操作的库.

You probably want to check out tldextract, a library designed to do this kind of thing.

它使用公共后缀列表尝试根据已知的gTLD进行合理的拆分,但请注意,这只是蛮力列表,没有什么特别的,因此它可以过时了(尽管希望如此,所以可以这样做)以免).

It uses the Public Suffix List to try and get a decent split based on known gTLDs, but do note that this is just a brute-force list, nothing special, so it can get out of date (although hopefully it's curated so as not to).

>>> import tldextract
>>> tldextract.extract('http://forums.news.cnn.com/')
ExtractResult(subdomain='forums.news', domain='cnn', suffix='com')

所以在您的情况下:

>>> extracted = tldextract.extract('http://www.google.com')
>>> "{}.{}".format(extracted.domain, extracted.suffix)
"google.com"

这篇关于Python urlparse-提取不带子域的域名的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆