Python:如何解析包含“.."的 URL [英] Python: How to resolve URLs containing '..'

查看:38
本文介绍了Python:如何解析包含“.."的 URL的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我需要唯一标识和存储一些 URL.问题是有时它们包含..",例如 http://somedomain.com/foo/bar/../../some/url 基本上是 http:///somedomain.com/some/url 如果我没记错的话.

是否有 Python 函数或解决此 URL 的棘手方法?

解决方案

使用 urllib.parse.urljoin:

<预><代码>>>>从 urllib.parse 导入 urljoin>>>urljoin('http://www.example.com/foo/bar/../../baz/bux/', '.')'http://www.example.com/baz/bux/'

但是,如果没有尾部斜杠(最后一个组件是文件,而不是目录),最后一个组件将被删除.

此修复使用 urlparse 函数提取路径,然后使用(posixpath 版本)os.path 规范化组件.补偿尾随斜杠的神秘问题,然后将 URL 重新连接在一起.以下是 doctestable:

from urllib.parse import urlparse导入posixpathdef resolve_components(url):""">>>resolve_components('http://www.example.com/foo/bar/../../baz/bux/')'http://www.example.com/baz/bux/'>>>resolve_components('http://www.example.com/some/path/../file.ext')'http://www.example.com/some/file.ext'"""解析 = urlparse(url)new_path = posixpath.normpath(parsed.path)如果 parsed.path.endswith('/'):# 补偿 issue1707768新路径 += '/'清洁 = 已解析._replace(path=new_path)返回cleaned.geturl()

I need to uniquely identify and store some URLs. The problem is that sometimes they come containing ".." like http://somedomain.com/foo/bar/../../some/url which basically is http://somedomain.com/some/url if I'm not wrong.

Is there a Python function or a tricky way to resolve this URLs ?

解决方案

There’s a simple solution using urllib.parse.urljoin:

>>> from urllib.parse import urljoin
>>> urljoin('http://www.example.com/foo/bar/../../baz/bux/', '.')
'http://www.example.com/baz/bux/'

However, if there is no trailing slash (the last component is a file, not a directory), the last component will be removed.

This fix uses the urlparse function to extract the path, then use (the posixpath version of) os.path to normalize the components. Compensate for a mysterious issue with trailing slashes, then join the URL back together. The following is doctestable:

from urllib.parse import urlparse
import posixpath

def resolve_components(url):
    """
    >>> resolve_components('http://www.example.com/foo/bar/../../baz/bux/')
    'http://www.example.com/baz/bux/'
    >>> resolve_components('http://www.example.com/some/path/../file.ext')
    'http://www.example.com/some/file.ext'
    """
    parsed = urlparse(url)
    new_path = posixpath.normpath(parsed.path)
    if parsed.path.endswith('/'):
        # Compensate for issue1707768
        new_path += '/'
    cleaned = parsed._replace(path=new_path)
    return cleaned.geturl()

这篇关于Python:如何解析包含“.."的 URL的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆