从网页上的相关网址重建绝对网址 [英] Reconstructing absolute urls from relative urls on a page
本文介绍了从网页上的相关网址重建绝对网址的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!
问题描述
给定网页的绝对网址以及在该网页中找到的相关链接,是否有办法 a)明确重建或 b)尽力而为重建相对链接的绝对URL?
Given an absolute url of a page, and a relative link found within that page, would there be a way to a) definitively reconstruct or b) best-effort reconstruct the absolute url of the relative link?
在我的例子中,我使用美丽的汤从给定的url中读取html文件,去除所有img标签源,并尝试构建页面图像的绝对URL的列表。
In my case, I'm reading an html file from a given url using beautiful soup, stripping out all the img tag sources, and trying to construct a list of absolute urls to the page images.
到目前为止,我的Python函数看起来像:
My Python function so far looks like:
function get_image_url(page_url,image_src):
from urlparse import urlparse
# parsed = urlparse('http://user:pass@NetLoc:80/path;parameters?query=argument#fragment')
parsed = urlparse(page_url)
url_base = parsed.netloc
url_path = parsed.path
if src.find('http') == 0:
# It's an absolute URL, do nothing.
pass
elif src.find('/') == 0:
# If it's a root URL, append it to the base URL:
src = 'http://' + url_base + src
else:
# If it's a relative URL, ?
注意:不需要Python答案,只需要逻辑。
NOTE: Don't need a Python answer, just the logic required.
推荐答案
非常简单:
very simple:
>>> from urlparse import urljoin
>>> urljoin('http://mysite.com/foo/bar/x.html', '../../images/img.png')
'http://mysite.com/images/img.png'
这篇关于从网页上的相关网址重建绝对网址的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!
查看全文