Python urlparse -- 提取没有子域的域名 [英] Python urlparse -- extract domain name without subdomain

查看：26 发布时间：2021/12/12 23:59:50 python parsing url urlparse

本文介绍了Python urlparse -- 提取没有子域的域名的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

需要一种使用 Python urlparse 从 url 中提取不带子域的域名的方法.

例如，我想从 "http://www.google.com" 等完整网址中提取 "google.com".

与 urlparse 最接近的是 netloc 属性，但它包括子域，在本例中为 www.google.com.

我知道可以编写一些自定义字符串操作来将 www.google.com 转换为 google.com，但我想在此任务中避免手动字符串转换或正则表达式.(这样做的原因是我对 url 形成规则不够熟悉，无法确信我可以考虑编写自定义解析函数所需的所有边缘情况.)

或者，如果 urlparse 不能做我需要的，有没有人知道任何其他的 Python url 解析库?

解决方案

您可能想查看 tldextract，一个专门做这种事情的库.

它使用公共后缀列表来尝试根据已知的 gTLD 进行适当的拆分，但请注意，这只是一个蛮力列表，没有什么特别之处，因此它可能会过时(尽管希望它是经过精心策划的)不一样).

<预><代码>>>>进口文摘>>>tldextract.extract('http://forums.news.cnn.com/')ExtractResult(subdomain='forums.news', domain='cnn', suffix='com')

所以在你的情况下:

<预><代码>>>>提取 = tldextract.extract('http://www.google.com')>>>"{}.{}".format(extracted.domain,extracted.suffix)google.com"

Need a way to extract a domain name without the subdomain from a url using Python urlparse.

For example, I would like to extract "google.com" from a full url like "http://www.google.com".

The closest I can seem to come with urlparse is the netloc attribute, but that includes the subdomain, which in this example would be www.google.com.

I know that it is possible to write some custom string manipulation to turn www.google.com into google.com, but I want to avoid by-hand string transforms or regex in this task. (The reason for this is that I am not familiar enough with url formation rules to feel confident that I could consider every edge case required in writing a custom parsing function.)

Or, if urlparse can't do what I need, does anyone know any other Python url-parsing libraries that would?

解决方案

You probably want to check out tldextract, a library designed to do this kind of thing.

It uses the Public Suffix List to try and get a decent split based on known gTLDs, but do note that this is just a brute-force list, nothing special, so it can get out of date (although hopefully it's curated so as not to).

>>> import tldextract
>>> tldextract.extract('http://forums.news.cnn.com/')
ExtractResult(subdomain='forums.news', domain='cnn', suffix='com')

So in your case:

>>> extracted = tldextract.extract('http://www.google.com')
>>> "{}.{}".format(extracted.domain, extracted.suffix)
"google.com"

这篇关于Python urlparse -- 提取没有子域的域名的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

Python urlparse -- 提取没有子域的域名 [英] Python urlparse -- extract domain name without subdomain

问题描述

相关文章

Python最新文章

热门教程

热门工具

登录关闭

Python urlparse -- 提取没有子域的域名 [英] Python urlparse -- extract domain name without subdomain

问题描述

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭