Python urlparse -- 提取没有子域的域名 [英] Python urlparse -- extract domain name without subdomain

查看:26
本文介绍了Python urlparse -- 提取没有子域的域名的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

需要一种使用 Python urlparse 从 url 中提取不带子域的域名的方法.

例如,我想从 "http://www.google.com" 等完整网址中提取 "google.com".

urlparse 最接近的是 netloc 属性,但它包括子域,在本例中为 www.google.com.

我知道可以编写一些自定义字符串操作来将 www.google.com 转换为 google.com,但我想在此任务中避免手动字符串转换或正则表达式.(这样做的原因是我对 url 形成规则不够熟悉,无法确信我可以考虑编写自定义解析函数所需的所有边缘情况.)

或者,如果 urlparse 不能做我需要的,有没有人知道任何其他的 Python url 解析库?

解决方案

您可能想查看 tldextract,一个专门做这种事情的库.

它使用公共后缀列表来尝试根据已知的 gTLD 进行适当的拆分,但请注意,这只是一个蛮力列表,没有什么特别之处,因此它可能会过时(尽管希望它是经过精心策划的)不一样).

<预><代码>>>>进口文摘>>>tldextract.extract('http://forums.news.cnn.com/')ExtractResult(subdomain='forums.news', domain='cnn', suffix='com')

所以在你的情况下:

<预><代码>>>>提取 = tldextract.extract('http://www.google.com')>>>"{}.{}".format(extracted.domain,extracted.suffix)google.com"

Need a way to extract a domain name without the subdomain from a url using Python urlparse.

For example, I would like to extract "google.com" from a full url like "http://www.google.com".

The closest I can seem to come with urlparse is the netloc attribute, but that includes the subdomain, which in this example would be www.google.com.

I know that it is possible to write some custom string manipulation to turn www.google.com into google.com, but I want to avoid by-hand string transforms or regex in this task. (The reason for this is that I am not familiar enough with url formation rules to feel confident that I could consider every edge case required in writing a custom parsing function.)

Or, if urlparse can't do what I need, does anyone know any other Python url-parsing libraries that would?

解决方案

You probably want to check out tldextract, a library designed to do this kind of thing.

It uses the Public Suffix List to try and get a decent split based on known gTLDs, but do note that this is just a brute-force list, nothing special, so it can get out of date (although hopefully it's curated so as not to).

>>> import tldextract
>>> tldextract.extract('http://forums.news.cnn.com/')
ExtractResult(subdomain='forums.news', domain='cnn', suffix='com')

So in your case:

>>> extracted = tldextract.extract('http://www.google.com')
>>> "{}.{}".format(extracted.domain, extracted.suffix)
"google.com"

这篇关于Python urlparse -- 提取没有子域的域名的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆