Python中的URL解析-标准化路径中的双斜杠 [英] URL parsing in Python - normalizing double-slash in paths
问题描述
我正在开发一个需要解析HTML页面中的URL(主要是HTTP URL)的应用程序-我无法控制输入,并且某些输入有些混乱.
I am working on an app which needs to parse URLs (mostly HTTP URLs) in HTML pages - I have no control over the input and some of it is, as expected, a bit messy.
我经常遇到的一个问题是,在解析和联接路径部分中带有双斜杠的URL时,urlparse非常严格(甚至可能有错误?),例如:
One problem I'm encountering frequently is that urlparse is very strict (and possibly even buggy?) when it comes to parsing and joining URLs that have double-slashes in the path part, for example:
testUrl = 'http://www.example.com//path?foo=bar'
urlparse.urljoin(testUrl,
urlparse.urlparse(testUrl).path)
我最终得到的是http://path
,而不是预期的结果http://www.example.com//path
(或者更好的是使用标准化的单斜杠).
Instead of the expected result http://www.example.com//path
(or even better, with a normalized single slash), I end up with http://path
.
顺便说一句,我之所以运行这样的代码,是因为这是迄今为止我找到的从URL剥离查询/片段部分的唯一方法.也许有更好的方法可以做到,但是我找不到.
BTW the reason I'm running such code is because it's the only way I found so far to strip the query / fragment part off of URLs. Maybe there is a better way to do it, but I couldn't find one.
任何人都可以推荐一种避免这种情况的方法,还是我应该使用一个(相对简单,我知道)正则表达式自己对路径进行规范化?
Can anyone recommend a way to avoid this, or should I just normalize the path myself using a (relatively simple, I know) regex?
推荐答案
如果只想获取不带查询部分的url,我将跳过urlparse模块,然后执行以下操作:
If you only want to get the url without the query part, I would skip the urlparse module and just do:
testUrl.rsplit('?')
该网址将在返回列表的索引0处,而查询将在索引1处.
The url will be at index 0 of the list returned and the query at index 1.
不可能有两个'?'在一个网址中,因此它适用于所有网址.
It is not possible to have two '?' in an url so it should work for all urls.
这篇关于Python中的URL解析-标准化路径中的双斜杠的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!