识别 URL 的文件扩展名 [英] Identify the file extension of a URL

查看:52
本文介绍了识别 URL 的文件扩展名的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

如果文件扩展名存在于网址中,我希望提取文件扩展名(试图确定哪些链接指向我不想要的扩展名列表,例如 .jpg.exe 等).

因此,我想从以下 URL www.example.com/image.jpg 中提取扩展名 jpg,并处理没有扩展名的情况例如 www.example.com/file(即不返回任何内容).

我想不出如何实现它,但我想到的一种方法是在最后一个点之后获取所有内容,如果有扩展名,我可以查看该扩展名,如果没有,对于示例 www.example.com/file 它将返回 com/file (给出的不在我排除的文件扩展名列表中,很好).

使用我不知道的包可能有另一种更好的方法,它可以识别什么是/不是实际扩展.(即处理 URL 实际上没有扩展名的情况).

解决方案

urlparse 模块(urllib.parse 提供了处理 URL 的工具.虽然它没有提供从 URL 中提取文件扩展名的方法,但可以通过将其与 os.path.splitext:

from urlparse import urlparse从 os.path 导入 splittextdef get_ext(url):"""从url返回文件扩展名,或''."""解析 = urlparse(url)root, ext = splitext(parsed.path)如果您不想要前导 '.',则返回 ext # 或 ext[1:]

示例用法:

<预><代码>>>>get_ext("www.example.com/image.jpg")'.jpg'>>>get_ext("https://www.example.com/page.html?foo=1&bar=2#fragment")'.html'>>>get_ext("https://www.example.com/resource")''

I am looking to extract the file extension if it exists for web addresses (trying to identify which links are to a list of extensions which I do not want e.g. .jpg, .exe etc).

So, I would want to extract from the following URL www.example.com/image.jpg the extension jpg, and also handle cases when there is no extension such as www.example.com/file (i.e. return nothing).

I can't think how to implement it, but one way I thought of was to get everything after the last dot, which if there was an extension would allow me to look that extension up, and if there wasn't, for the example www.example.com/file it would return com/file (which given is not in my list of excluded file-extensions, is fine).

There may be an alternative superior way using a package I am not aware of, which could identify what is/isn't an actual extension. (i.e. cope with cases when the URL does not actually have an extension).

解决方案

The urlparse module (urllib.parse in Python 3) provides tools for working with URLs. Although it doesn't provide a way to extract the file extension from a URL, it's possible to do so by combining it with os.path.splitext:

from urlparse import urlparse
from os.path import splitext

def get_ext(url):
    """Return the filename extension from url, or ''."""
    parsed = urlparse(url)
    root, ext = splitext(parsed.path)
    return ext  # or ext[1:] if you don't want the leading '.'

Example usage:

>>> get_ext("www.example.com/image.jpg")
'.jpg'
>>> get_ext("https://www.example.com/page.html?foo=1&bar=2#fragment")
'.html'
>>> get_ext("https://www.example.com/resource")
''

这篇关于识别 URL 的文件扩展名的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆