为什么在使用 multipart/form-data 时不能正确发送带有 Unicode 的 POST 名称? [英] Why aren't POST names with Unicode sent correctly when using multipart/form-data?

查看:22
本文介绍了为什么在使用 multipart/form-data 时不能正确发送带有 Unicode 的 POST 名称?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想发送一个附带文件的 POST 请求,尽管某些字段名称中包含 Unicode 字符.但是服务器没有正确接收它们,如下所示:

<预><代码>>>># 正常,没有unicode>>>resp = requests.post('http://httpbin.org/post', data={'snowman': 'hello'}, files={('kitten.jpg', open('kitten.jpg', 'rb'))}).json()['表单']>>>分别{u'snowman': u'hello'}>>>>>># 用unicode,看到名字变成了'null'>>>resp = requests.post('http://httpbin.org/post', data={'☃': 'hello'}, files={('kitten.jpg', open('kitten.jpg', 'rb'))}).json()['表单']>>>分别{u'null': u'hello'}>>>>>># 它可以在没有图像的情况下工作>>>resp = requests.post('http://httpbin.org/post', data={'☃': 'hello'}).json()['form']>>>分别{u'u2603':你好'}

我该如何解决这个问题?

解决方案

从wireshark 的评论来看,看起来python-requests 做错了,但可能没有正确答案".

RFC 2388

<块引用>

最初采用非 ASCII 字符集的字段名称可能会在名称"的值中进行编码;参数使用 RFC 2047 中描述的标准方法.

RFC 2047,反过来说><块引用>

通常,一个编码字"是一系列可打印的 ASCII 字符,以=?"开头,以?="结尾,中间有两个?".它指定了一个字符集和一种编码方法,还包括按照该编码方法的规则编码为图形 ASCII 字符的原始文本.

并继续描述Q"和B"编码方法.使用Q"(quoted-printable) 方法,名称为:

=?utf-8?q?=E2=98=83?=

但是,如RFC 6266 明确指出:

<块引用>

encoded-word"不得用于 MIME Con​​tent-Type 或 Content-Disposition 字段的参数中,也不得用于任何结构化字段正文中,评论"或短语"除外.

所以我们不允许这样做.(感谢@Lukasa 为这次捕获!)

RFC 2388 也说

<块引用>

也可以提供原始本地文件名,或者作为文件名"参数内容处置:表单数据"之一标头,或者在多个文件的情况下,在内容处置"中:档案"子部分的标题.发送应用程序可以提供一个文件名;如果发件人操作系统的文件名不是在 US-ASCII 中,文件名可能是近似的,或使用编码RFC 2231的方法.

而且 RFC 2231 描述了一种看起来更像的方法就像你看到的一样.其中,

<块引用>

星号(*")被重用以提供语言和存在字符集信息并且正在使用编码.一个单引号('")用于分隔字符集和语言参数值开头的信息.百分号("%") 用作编码标志,符合 RFC 2047.

<块引用>

具体来说,参数名称末尾的星号充当指示字符集和语言信息可能出现在参数值的开头.单引号用于将字符集、语言和实际值信息分开参数值字符串,用百分号表示以十六进制编码的八位字节.

也就是说,如果采用这种方法(并且两端都支持),则名称应为:

name*=utf-8''%E2%98%83

幸运的是,RFC 5987 添加了编码基于 RFC 2231 到 HTTP 标头!(为此发现感谢@bobince)它说您可以(任何人都可能应该)包含一个 RFC 2231 样式的值一个普通值:

<块引用>

Header 字段规范需要定义是否多个实例允许具有相同 parmname 组件的参数,以及如何他们应该被处理.该规范建议一个使用扩展语法的参数优先.这个会允许生产者使用这两种格式而不会破坏接收者还不了解扩展语法.

<块引用>

示例:

<块引用>

foo: 酒吧;title=欧元汇率";标题*=utf-8''%e2%82%ac%20exchange%20rates

然而,在他们的例子中,他们哑巴"了.遗留客户"的普通价值.这并不是表单字段名称的真正选项,因此最好的方法似乎是同时包含 name=name*= 版本,其中纯值是(如@bobince 描述的那样)仅发送字节,引用,以与表单相同的编码",例如:

Content-Disposition: form-data;名称=☃";名称*=utf-8''%E2%98%83

另见:

最后,请参见 http://larry.masinter.net/1307multipart-form-data.pdf(还有 https://www.w3.org/Bugs/Public/show_bug.cgi?id=16909#c8 ),其中建议通过坚持使用 ASCII 表单字段名称来避免该问题.

I want to sent a POST request with a file attached, though some of the field names have Unicode characters in them. But they aren't received correctly by the server, as seen below:

>>> # normal, without unicode
>>> resp = requests.post('http://httpbin.org/post', data={'snowman': 'hello'}, files={('kitten.jpg', open('kitten.jpg', 'rb'))}).json()['form']
>>> resp
{u'snowman': u'hello'}
>>>
>>> # with unicode, see that the name has become 'null'
>>> resp = requests.post('http://httpbin.org/post', data={'☃': 'hello'}, files={('kitten.jpg', open('kitten.jpg', 'rb'))}).json()['form']
>>> resp
{u'null': u'hello'}
>>>
>>> # it works without the image
>>> resp = requests.post('http://httpbin.org/post', data={'☃': 'hello'}).json()['form']
>>> resp
{u'u2603': u'hello'}

How do I come around this problem?

解决方案

From the wireshark comments, it looks like python-requests is doing it wrong, but that there might not be a "right answer".

RFC 2388 says

Field names originally in non-ASCII character sets may be encoded within the value of the "name" parameter using the standard method described in RFC 2047.

RFC 2047, in turn, says

Generally, an "encoded-word" is a sequence of printable ASCII characters that begins with "=?", ends with "?=", and has two "?"s in between. It specifies a character set and an encoding method, and also includes the original text encoded as graphic ASCII characters, according to the rules for that encoding method.

and goes on to describe "Q" and "B" encoding methods. Using the "Q" (quoted-printable) method, the name would be:

=?utf-8?q?=E2=98=83?=

BUT, as RFC 6266 clearly states:

An 'encoded-word' MUST NOT be used in parameter of a MIME Content-Type or Content-Disposition field, or in any structured field body except within a 'comment' or 'phrase'.

so we're not allowed to do that. (Kudos to @Lukasa for this catch!)

RFC 2388 also says

The original local file name may be supplied as well, either as a "filename" parameter either of the "content-disposition: form-data" header or, in the case of multiple files, in a "content-disposition: file" header of the subpart. The sending application MAY supply a file name; if the file name of the sender's operating system is not in US-ASCII, the file name might be approximated, or encoded using the method of RFC 2231.

And RFC 2231 describes a method that looks more like what you're seeing. In it,

Asterisks ("*") are reused to provide the indicator that language and character set information is present and encoding is being used. A single quote ("'") is used to delimit the character set and language information at the beginning of the parameter value. Percent signs ("%") are used as the encoding flag, which agrees with RFC 2047.

Specifically, an asterisk at the end of a parameter name acts as an indicator that character set and language information may appear at the beginning of the parameter value. A single quote is used to separate the character set, language, and actual value information in the parameter value string, and an percent sign is used to flag octets encoded in hexadecimal.

That is, if this method is employed (and supported on both ends), the name should be:

name*=utf-8''%E2%98%83

Fortunately, RFC 5987 adds an encoding based on RFC 2231 to HTTP headers! (Kudos to @bobince for this find) It says you can (any probably should) include both a RFC 2231-style value and a plain value:

Header field specifications need to define whether multiple instances of parameters with identical parmname components are allowed, and how they should be processed. This specification suggests that a parameter using the extended syntax takes precedence. This would allow producers to use both formats without breaking recipients that do not understand the extended syntax yet.

Example:

foo: bar; title="EURO exchange rates"; title*=utf-8''%e2%82%ac%20exchange%20rates

In their example, however, they "dumb down" the plain value for "legacy clients". This isn't really an option for a form-field name, so it seems like the best approach might be to include both name= and name*= versions, where the plain value is (as @bobince describes it) "just sending the bytes, quoted, in the same encoding as the form", like:

Content-Disposition: form-data; name="☃"; name*=utf-8''%E2%98%83

See also:

Finally, see http://larry.masinter.net/1307multipart-form-data.pdf (also https://www.w3.org/Bugs/Public/show_bug.cgi?id=16909#c8 ), wherein it is recommended to avoid the problem by sticking with ASCII form field names.

这篇关于为什么在使用 multipart/form-data 时不能正确发送带有 Unicode 的 POST 名称?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆