Python中的URL解析-标准化路径中的双斜杠 [英] URL parsing in Python - normalizing double-slash in paths

查看：482 发布时间：2020/7/13 2:09:30 python urlparse

本文介绍了Python中的URL解析-标准化路径中的双斜杠的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在开发一个需要解析HTML页面中的URL(主要是HTTP URL)的应用程序-我无法控制输入，并且某些输入有些混乱.

I am working on an app which needs to parse URLs (mostly HTTP URLs) in HTML pages - I have no control over the input and some of it is, as expected, a bit messy.

我经常遇到的一个问题是，在解析和联接路径部分中带有双斜杠的URL时，urlparse非常严格(甚至可能有错误?)，例如:

One problem I'm encountering frequently is that urlparse is very strict (and possibly even buggy?) when it comes to parsing and joining URLs that have double-slashes in the path part, for example:

testUrl = 'http://www.example.com//path?foo=bar'
urlparse.urljoin(testUrl, 
                 urlparse.urlparse(testUrl).path)

我最终得到的是http://path，而不是预期的结果http://www.example.com//path(或者更好的是使用标准化的单斜杠).

Instead of the expected result http://www.example.com//path (or even better, with a normalized single slash), I end up with http://path.

顺便说一句，我之所以运行这样的代码，是因为这是迄今为止我找到的从URL剥离查询/片段部分的唯一方法.也许有更好的方法可以做到，但是我找不到.

BTW the reason I'm running such code is because it's the only way I found so far to strip the query / fragment part off of URLs. Maybe there is a better way to do it, but I couldn't find one.

任何人都可以推荐一种避免这种情况的方法，还是我应该使用一个(相对简单，我知道)正则表达式自己对路径进行规范化?

Can anyone recommend a way to avoid this, or should I just normalize the path myself using a (relatively simple, I know) regex?

推荐答案

如果只想获取不带查询部分的url，我将跳过urlparse模块，然后执行以下操作:

If you only want to get the url without the query part, I would skip the urlparse module and just do:

testUrl.rsplit('?')

该网址将在返回列表的索引0处，而查询将在索引1处.

The url will be at index 0 of the list returned and the query at index 1.

不可能有两个'?'在一个网址中，因此它适用于所有网址.

It is not possible to have two '?' in an url so it should work for all urls.

这篇关于Python中的URL解析-标准化路径中的双斜杠的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

Python中的URL解析-标准化路径中的双斜杠 [英] URL parsing in Python - normalizing double-slash in paths

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

Python中的URL解析-标准化路径中的双斜杠 [英] URL parsing in Python - normalizing double-slash in paths

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭