规范化/规范化 URL? [英] Canonicalize / normalize a URL?
问题描述
我正在寻找一个库函数来规范 Python 中的 URL,即删除路径中的./"或../"部分,或添加默认端口或转义特殊字符等.结果应该是一个字符串,该字符串对于指向同一网页的两个 URL 来说是唯一的.例如 http://google.com
和 http://google.com:80/a/../
应返回相同的结果.
I am searching for a library function to normalize a URL in Python, that is to remove "./" or "../" parts in the path, or add a default port or escape special characters and so on. The result should be a string that is unique for two URLs pointing to the same web page. For example http://google.com
and http://google.com:80/a/../
shall return the same result.
我更喜欢 Python 3 并且已经浏览了 urllib
模块.它提供了拆分 URL 的功能,但没有将它们规范化.Java 有 URI.normalize()
函数可以做类似的事情(虽然它不认为默认端口 80 等于没有给定的端口),但是有没有类似 python 的东西?
I would prefer Python 3 and already looked through the urllib
module. It offers functions to split URLs but nothing to canonicalize them. Java has the URI.normalize()
function that does a similar thing (though it does not consider the default port 80 equal to no given port), but is there something like this is python?
推荐答案
遵循良好的开端,我编写了一个方法适合网络中常见的大多数情况.
Following the good start, I composed a method that fits most of the cases commonly found in the web.
def urlnorm(base, link=''):
'''Normalizes an URL or a link relative to a base url. URLs that point to the same resource will return the same string.'''
new = urlparse(urljoin(base, url).lower())
return urlunsplit((
new.scheme,
(new.port == None) and (new.hostname + ":80") or new.netloc,
new.path,
new.query,
''))
这篇关于规范化/规范化 URL?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!