Python urlparse:小问题 [英] Python urlparse: small issue

查看:154
本文介绍了Python urlparse:小问题的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在开发一个解析html并从中获取图像的应用程序.使用Beautiful Soup解析和下载html很容易,并且图像也可以通过urllib2使用.

I'm making an app that parses html and gets images from it. Parsing is easy using Beautiful Soup and downloading of the html and the images works too with urllib2.

我确实有urlparse的问题,无法从相对路径中创建绝对路径.最好用一个例子来说明这个问题:

I do have a problem with urlparse to make absolute paths out of relative ones. The problem is best explained with an example:

>>> import urlparse
>>> urlparse.urljoin("http://www.example.com/", "../test.png")
'http://www.example.com/../test.png'

如您所见,urlparse不会删除../.当我尝试下载图像时,这会产生问题:

As you can see, urlparse doesn't take away the ../ away. This gives a problem when I try to download the image:

HTTPError: HTTP Error 400: Bad Request

在urllib中是否有解决此问题的方法?

Is there a way to fix this problem in urllib?

推荐答案

我认为您最好的办法是预先解析原始URL,然后检查路径组件.一个简单的测试是

I think the best you can do is to pre-parse the original URL, and check the path component. A simple test is

if len(urlparse.urlparse(baseurl).path) > 1:

然后,您可以将其与demas建议的索引结合使用.例如:

Then you can combine it with the indexing suggested by demas. For example:

start_offset = (len(urlparse.urlparse(baseurl).path) <= 1) and 2 or 0
img_url = urlparse.urljoin("http://www.example.com/", "../test.png"[start_offset:])

这样,您将不会尝试转到根URL的父级.

This way, you will not attempt to go to the parent of the root URL.

这篇关于Python urlparse:小问题的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆