规范化/规范化 URL? [英] Canonicalize / normalize a URL?

查看:89
本文介绍了规范化/规范化 URL?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在寻找一个库函数来规范 Python 中的 URL,即删除路径中的./"或../"部分,或添加默认端口或转义特殊字符等.结果应该是一个字符串,该字符串对于指向同一网页的两个 URL 来说是唯一的.例如 http://google.comhttp://google.com:80/a/../ 应返回相同的结果.

I am searching for a library function to normalize a URL in Python, that is to remove "./" or "../" parts in the path, or add a default port or escape special characters and so on. The result should be a string that is unique for two URLs pointing to the same web page. For example http://google.com and http://google.com:80/a/../ shall return the same result.

我更喜欢 Python 3 并且已经浏览了 urllib 模块.它提供了拆分 URL 的功能,但没有将它们规范化.Java 有 URI.normalize() 函数可以做类似的事情(虽然它不认为默认端口 80 等于没有给定的端口),但是有没有类似 python 的东西?

I would prefer Python 3 and already looked through the urllib module. It offers functions to split URLs but nothing to canonicalize them. Java has the URI.normalize() function that does a similar thing (though it does not consider the default port 80 equal to no given port), but is there something like this is python?

推荐答案

遵循良好的开端,我编写了一个方法适合网络中常见的大多数情况.

Following the good start, I composed a method that fits most of the cases commonly found in the web.

def urlnorm(base, link=''):
  '''Normalizes an URL or a link relative to a base url. URLs that point to the same resource will return the same string.'''
  new = urlparse(urljoin(base, url).lower())
  return urlunsplit((
    new.scheme,
    (new.port == None) and (new.hostname + ":80") or new.netloc,
    new.path,
    new.query,
    ''))

这篇关于规范化/规范化 URL?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆