如何确定在Python中使用HTTP下载的内容的文件名? [英] how to determine the filename of content downloaded with HTTP in Python?

查看:362
本文介绍了如何确定在Python中使用HTTP下载的内容的文件名?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我使用Python requests 库的get函数下载文件.对于存储文件,我想确定文件名,就像Web浏览器将其用于保存"或另存为..."对话框一样.

I download a file using the get function of Python requests library. For storing the file, I'd like to determine the filename they way a web browser would for its 'save' or 'save as ...' dialog.

容易,对吧?我可以Content-Disposition中获取它 HTTP标头,可在响应对象上访问:

Easy, right? I can just get it from the Content-Disposition HTTP header, accessible on the response object:

import re
d = r.headers['content-disposition']
fname = re.findall("filename=(.+)", d)

但是更仔细地研究这个主题, 并不容易:

But looking more closely at this topic, it isn't that easy:

根据 RFC 6266第4.3 条,以及第4.1节,该值可以是未加引号的标记(例如the_report.pdf)或带引号的也可以包含空格(例如"the report.pdf")和转义序列的字符串.此外,

According to RFC 6266 section 4.3, and the grammar in the section 4.1, the value can be an unquoted token (e.g. the_report.pdf) or a quoted string that can also contain whitespace (e.g. "the report.pdf") and escape sequences. Further,

当在单个标头字段值中同时存在文件名"和文件名*"时,[我们]应选择文件名*"而忽略文件名".

when both "filename" and "filename*" are present in a single header field value, [we] SHOULD pick "filename*" and ignore "filename".

filename*的值是又有些复杂而不是filename之一.

此外,RFC似乎允许在=周围留有额外的空白.

Also, the RFC seems to allow for additional whitespace around the =.

因此,对于RFC中列出的示例,我想要以下结果:

Thus, for the examples listed in the RFC, I'd want the following results:

Content-Disposition: Attachment; filename=example.html

文件名:example.html

  • filename: example.html

  • Content-Disposition: INLINE; FILENAME= "an example.html"
    

    文件名:an example.html

  • filename: an example.html

  • Content-Disposition: attachment;
                         filename*= UTF-8''%e2%82%ac%20rates
    

    文件名:€ rates

  • filename: € rates

  • Content-Disposition: attachment;
                         filename="EURO rates";
                         filename*=utf-8''%e2%82%ac%20rates
    

    文件名:€ rates也在这里(不是EURO rates,因为filename*优先)

  • filename: € rates here, too (not EURO rates, as filename* takes precedence)

    现在,我可以轻松地调整正则表达式以解决=周围的可变空白,但是让它处理所有其他变体也将变得很笨拙. (使用引号和转义符,我什至不确定RegEx是否可以涵盖所有情况.也许可以,因为不涉及大括号嵌套.)

    Now, I could easily adapt the regular expression to account for variable whitespace around the =, but having it handle all the other variations, too, would get rather unwieldy. (With the quoting and escaping, I'm not even sure RegEx can cover all the cases. Maybe they can, as there is no brace-nesting involved.)

    因此我是否必须实现一个成熟的解析器,还是可以通过对HTTP库的一些调用来根据RFC 6266确定文件名(也许是requests本身)?由于RFC 6266是HTTP标准的一部分,我可以想象一些专门用于HTTP的库已经涵盖了这一点. (因此,我也也在软件建议中被询问SE .)

    So do I have to implement a full-blown parser, or can I determine filename according to RFC 6266 by some few calls to a HTTP library (maybe requests itself)? As RFC 6266 is part of the HTTP standard, I could imagine that some libraries specialized on HTTP already cover this. (So I've also asked on Software Recommendations SE.)

    推荐答案

    出现 rfc6266 库完全满足您的需求.它可以解析原始标头,requests响应和urllib2响应.在 PyPI 上.

    The rfc6266 library appears to do exactly what you need. It can parse raw headers, requests responses, and urllib2 responses. It's on PyPI.

    一些例子:

    >>> import rfc6266, requests
    >>> rfc6266.parse_headers('''Attachment; filename=example.html''').filename_unsafe
    'example.html'
    >>> rfc6266.parse_headers('''INLINE; FILENAME= "an example.html"''').filename_unsafe
    'an example.html'
    >>> rfc6266.parse_headers(
        '''attachment; '''
        '''filename*= UTF-8''%e2%82%ac%20rates''').filename_unsafe
    '€ rates'
    >>> rfc6266.parse_headers(
        '''attachment; '''
        '''filename="EURO rates"; '''
        '''filename*=utf-8''%e2%82%ac%20rates''').filename_unsafe
    '€ rates'
    >>> r = requests.get('http://example.com/€ rates')
    >>> rfc6266.parse_requests_response(r).filename_unsafe
    '€ rates'
    

    不过,请注意:该库与标头中的非标准空格一样,.

    As a note, though: this library does not like nonstandard whitespace in the header.

    这篇关于如何确定在Python中使用HTTP下载的内容的文件名?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

    查看全文
    登录 关闭
    扫码关注1秒登录
    发送“验证码”获取 | 15天全站免登陆