Python:如何解析包含“.."的 URL [英] Python: How to resolve URLs containing '..'
问题描述
我需要唯一标识和存储一些 URL.问题是有时它们包含..",例如 http://somedomain.com/foo/bar/../../some/url
基本上是 http:///somedomain.com/some/url
如果我没记错的话.
是否有 Python 函数或解决此 URL 的棘手方法?
但是,如果没有尾部斜杠(最后一个组件是文件,而不是目录),最后一个组件将被删除.
此修复使用 urlparse 函数提取路径,然后使用(posixpath 版本)os.path
规范化组件.补偿尾随斜杠的神秘问题,然后将 URL 重新连接在一起.以下是 doctest
able:
from urllib.parse import urlparse导入posixpathdef resolve_components(url):""">>>resolve_components('http://www.example.com/foo/bar/../../baz/bux/')'http://www.example.com/baz/bux/'>>>resolve_components('http://www.example.com/some/path/../file.ext')'http://www.example.com/some/file.ext'"""解析 = urlparse(url)new_path = posixpath.normpath(parsed.path)如果 parsed.path.endswith('/'):# 补偿 issue1707768新路径 += '/'清洁 = 已解析._replace(path=new_path)返回cleaned.geturl()
I need to uniquely identify and store some URLs. The problem is that sometimes they come containing ".." like http://somedomain.com/foo/bar/../../some/url
which basically is http://somedomain.com/some/url
if I'm not wrong.
Is there a Python function or a tricky way to resolve this URLs ?
There’s a simple solution using urllib.parse.urljoin
:
>>> from urllib.parse import urljoin
>>> urljoin('http://www.example.com/foo/bar/../../baz/bux/', '.')
'http://www.example.com/baz/bux/'
However, if there is no trailing slash (the last component is a file, not a directory), the last component will be removed.
This fix uses the urlparse function to extract the path, then use (the posixpath version of) os.path
to normalize the components. Compensate for a mysterious issue with trailing slashes, then join the URL back together. The following is doctest
able:
from urllib.parse import urlparse
import posixpath
def resolve_components(url):
"""
>>> resolve_components('http://www.example.com/foo/bar/../../baz/bux/')
'http://www.example.com/baz/bux/'
>>> resolve_components('http://www.example.com/some/path/../file.ext')
'http://www.example.com/some/file.ext'
"""
parsed = urlparse(url)
new_path = posixpath.normpath(parsed.path)
if parsed.path.endswith('/'):
# Compensate for issue1707768
new_path += '/'
cleaned = parsed._replace(path=new_path)
return cleaned.geturl()
这篇关于Python:如何解析包含“.."的 URL的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!