获取链接的根域 [英] Get Root Domain of Link
问题描述
我有一个链接,例如 http://www.techcrunch.com/ ,我想只得到techcrunch.com的链接部分。在python中如何处理?
I have a link such as http://www.techcrunch.com/ and I would like to get just the techcrunch.com part of the link. How do I go about this in python?
推荐答案
使用 urlparse :
hostname = urlparse.urlparse("http://www.techcrunch.com/").hostname
获取然而,根域会更有问题,因为它没有在句法意义上定义。 www.theregister.co.uk的根域是什么?使用默认域的网络如何? devbox12可能是一个有效的主机名。
Getting the "root domain", however, is going to be more problematic, because it isn't defined in a syntactic sense. What's the root domain of "www.theregister.co.uk"? How about networks using default domains? "devbox12" could be a valid hostname.
然而,对于最常见的情况,您可以特别处理前者,忽略后者,但是意识到它赢得了' t 100%准确。
For the most common cases, however, you can probably handle the former specially and ignore the latter, but aware that it won't 100% accurate.
hostname = urlparse.urlparse(url).hostname.split(".")
hostname = ".".join(len(hostname[-2]) < 4 and hostname[-3:] or hostname[-2:])
如果下一个到最后一个小于四个字符(例如.com.au,.co.uk)和最后两个部分。
This uses the last three parts if the next-to-last part is less than four characters (e.g. ".com.au", ".co.uk") and the last two parts otherwise.
这篇关于获取链接的根域的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!