Python中的URL解析-标准化路径中的双斜杠 [英] URL parsing in Python - normalizing double-slash in paths

查看:482
本文介绍了Python中的URL解析-标准化路径中的双斜杠的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在开发一个需要解析HTML页面中的URL(主要是HTTP URL)的应用程序-我无法控制输入,并且某些输入有些混乱.

I am working on an app which needs to parse URLs (mostly HTTP URLs) in HTML pages - I have no control over the input and some of it is, as expected, a bit messy.

我经常遇到的一个问题是,在解析和联接路径部分中带有双斜杠的URL时,urlparse非常严格(甚至可能有错误?),例如:

One problem I'm encountering frequently is that urlparse is very strict (and possibly even buggy?) when it comes to parsing and joining URLs that have double-slashes in the path part, for example:

testUrl = 'http://www.example.com//path?foo=bar'
urlparse.urljoin(testUrl, 
                 urlparse.urlparse(testUrl).path)

我最终得到的是http://path,而不是预期的结果http://www.example.com//path(或者更好的是使用标准化的单斜杠).

Instead of the expected result http://www.example.com//path (or even better, with a normalized single slash), I end up with http://path.

顺便说一句,我之所以运行这样的代码,是因为这是迄今为止我找到的从URL剥离查询/片段部分的唯一方法.也许有更好的方法可以做到,但是我找不到.

BTW the reason I'm running such code is because it's the only way I found so far to strip the query / fragment part off of URLs. Maybe there is a better way to do it, but I couldn't find one.

任何人都可以推荐一种避免这种情况的方法,还是我应该使用一个(相对简单,我知道)正则表达式自己对路径进行规范化?

Can anyone recommend a way to avoid this, or should I just normalize the path myself using a (relatively simple, I know) regex?

推荐答案

如果只想获取不带查询部分的url,我将跳过urlparse模块,然后执行以下操作:

If you only want to get the url without the query part, I would skip the urlparse module and just do:

testUrl.rsplit('?')

该网址将在返回列表的索引0处,而查询将在索引1处.

The url will be at index 0 of the list returned and the query at index 1.

不可能有两个'?'在一个网址中,因此它适用于所有网址.

It is not possible to have two '?' in an url so it should work for all urls.

这篇关于Python中的URL解析-标准化路径中的双斜杠的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆