如何确定在Python中使用HTTP下载的内容的文件名? [英] how to determine the filename of content downloaded with HTTP in Python?
问题描述
我使用Python requests
库的get
函数下载文件.对于存储文件,我想确定文件名,就像Web浏览器将其用于保存"或另存为..."对话框一样.
I download a file using the get
function of Python requests
library. For storing the file, I'd like to determine the filename they way a web browser would for its 'save' or 'save as ...' dialog.
容易,对吧?我可以从Content-Disposition
中获取它 HTTP标头,可在响应对象上访问:
Easy, right? I can just get it from the Content-Disposition
HTTP header, accessible on the response object:
import re
d = r.headers['content-disposition']
fname = re.findall("filename=(.+)", d)
但是更仔细地研究这个主题, 并不容易:
But looking more closely at this topic, it isn't that easy:
根据 RFC 6266第4.3 条,以及第4.1节,该值可以是未加引号的标记(例如the_report.pdf
)或带引号的也可以包含空格(例如"the report.pdf"
)和转义序列的字符串.此外,
According to RFC 6266 section 4.3, and the grammar in the section 4.1, the value can be an unquoted token (e.g. the_report.pdf
) or a quoted string that can also contain whitespace (e.g. "the report.pdf"
) and escape sequences. Further,
当在单个标头字段值中同时存在文件名"和文件名*"时,[我们]应选择文件名*"而忽略文件名".
when both "filename" and "filename*" are present in a single header field value, [we] SHOULD pick "filename*" and ignore "filename".
filename*
的值是又有些复杂而不是filename
之一.
此外,RFC似乎允许在=
周围留有额外的空白.
Also, the RFC seems to allow for additional whitespace around the =
.
因此,对于RFC中列出的示例,我想要以下结果:
Thus, for the examples listed in the RFC, I'd want the following results:
Content-Disposition: Attachment; filename=example.html
文件名:example.html
filename: example.html
Content-Disposition: INLINE; FILENAME= "an example.html"
文件名:an example.html
filename: an example.html
Content-Disposition: attachment;
filename*= UTF-8''%e2%82%ac%20rates
文件名:€ rates
filename: € rates
Content-Disposition: attachment;
filename="EURO rates";
filename*=utf-8''%e2%82%ac%20rates
文件名:€ rates
也在这里(不是EURO rates
,因为filename*
优先)
filename: € rates
here, too (not EURO rates
, as filename*
takes precedence)
现在,我可以轻松地调整正则表达式以解决=
周围的可变空白,但是让它处理所有其他变体也将变得很笨拙. (使用引号和转义符,我什至不确定RegEx是否可以涵盖所有情况.也许可以,因为不涉及大括号嵌套.)
Now, I could easily adapt the regular expression to account for variable whitespace around the =
, but having it handle all the other variations, too, would get rather unwieldy. (With the quoting and escaping, I'm not even sure RegEx can cover all the cases. Maybe they can, as there is no brace-nesting involved.)
因此我是否必须实现一个成熟的解析器,还是可以通过对HTTP库的一些调用来根据RFC 6266确定文件名(也许是requests
本身)?由于RFC 6266是HTTP标准的一部分,我可以想象一些专门用于HTTP的库已经涵盖了这一点. (因此,我也也在软件建议中被询问SE .)
So do I have to implement a full-blown parser, or can I determine filename according to RFC 6266 by some few calls to a HTTP library (maybe requests
itself)? As RFC 6266 is part of the HTTP standard, I could imagine that some libraries specialized on HTTP already cover this. (So I've also asked on Software Recommendations SE.)
推荐答案
出现 rfc6266
库完全满足您的需求.它可以解析原始标头,requests
响应和urllib2
响应.在 PyPI 上.
The rfc6266
library appears to do exactly what you need. It can parse raw headers, requests
responses, and urllib2
responses. It's on PyPI.
一些例子:
>>> import rfc6266, requests
>>> rfc6266.parse_headers('''Attachment; filename=example.html''').filename_unsafe
'example.html'
>>> rfc6266.parse_headers('''INLINE; FILENAME= "an example.html"''').filename_unsafe
'an example.html'
>>> rfc6266.parse_headers(
'''attachment; '''
'''filename*= UTF-8''%e2%82%ac%20rates''').filename_unsafe
'€ rates'
>>> rfc6266.parse_headers(
'''attachment; '''
'''filename="EURO rates"; '''
'''filename*=utf-8''%e2%82%ac%20rates''').filename_unsafe
'€ rates'
>>> r = requests.get('http://example.com/€ rates')
>>> rfc6266.parse_requests_response(r).filename_unsafe
'€ rates'
不过,请注意:该库与标头中的非标准空格一样,不.
As a note, though: this library does not like nonstandard whitespace in the header.
这篇关于如何确定在Python中使用HTTP下载的内容的文件名?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!