Python urlparse:小问题 [英] Python urlparse: small issue
问题描述
我正在开发一个解析html并从中获取图像的应用程序.使用Beautiful Soup解析和下载html很容易,并且图像也可以通过urllib2使用.
I'm making an app that parses html and gets images from it. Parsing is easy using Beautiful Soup and downloading of the html and the images works too with urllib2.
我确实有urlparse的问题,无法从相对路径中创建绝对路径.最好用一个例子来说明这个问题:
I do have a problem with urlparse to make absolute paths out of relative ones. The problem is best explained with an example:
>>> import urlparse
>>> urlparse.urljoin("http://www.example.com/", "../test.png")
'http://www.example.com/../test.png'
如您所见,urlparse不会删除../.当我尝试下载图像时,这会产生问题:
As you can see, urlparse doesn't take away the ../ away. This gives a problem when I try to download the image:
HTTPError: HTTP Error 400: Bad Request
在urllib中是否有解决此问题的方法?
Is there a way to fix this problem in urllib?
推荐答案
我认为您最好的办法是预先解析原始URL,然后检查路径组件.一个简单的测试是
I think the best you can do is to pre-parse the original URL, and check the path component. A simple test is
if len(urlparse.urlparse(baseurl).path) > 1:
然后,您可以将其与demas建议的索引结合使用.例如:
Then you can combine it with the indexing suggested by demas. For example:
start_offset = (len(urlparse.urlparse(baseurl).path) <= 1) and 2 or 0
img_url = urlparse.urljoin("http://www.example.com/", "../test.png"[start_offset:])
这样,您将不会尝试转到根URL的父级.
This way, you will not attempt to go to the parent of the root URL.
这篇关于Python urlparse:小问题的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!