HTTP查询和URI编码存疑 [英] HTTP query and URI encoding doubts

查看:122
本文介绍了HTTP查询和URI编码存疑的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

最近,我在研究HTTP查询字符串,同时想知道Web服务访问接口 API 的可能性。它似乎非常不明确。

Recently I was researching HTTP query strings while wondering about possibilities on web service access interface API. And it seems very underspecified.

事实上 RFC 3986 (统一资源标识符(URI):通用语法)没有说明查询字符串片段的格式,最后定义允许哪些字符以及如何编码其他字符。 (我稍后会再回过头来看。)

In fact RFC 3986 (Uniform Resource Identifier (URI): Generic Syntax) doesn’t say anything about format of the query string fragment and ends on defining which characters are allowed and how to encode other characters. (I will return to this later.)

我发现的唯一一件事是关于表格如何被破坏成查询字符串的HTML规范( HTML 4.01; 17.13.4表单内容类型,application / x-www-form-urlencoded )。 HTML 5算法似乎足够接近( 4.10.22.5 URL - 编码表格数据)。

The only thing I found was HTML specification on how forms are mangled into query string (HTML 4.01; 17.13.4 Form content types, application/x-www-form-urlencoded). HTML 5 algorithm seems close enough (4.10.22.5 URL-encoded form data).

这似乎没问题。毕竟为什么有人想为其他人设置查询字符串格式。做什么的?但是还有其他(不是HTML)完善的标准吗?是否有其他人使用不同的格式?

This might seem OK. After all why would anyone want to set a query string format for everyone else. What for? But are there any other (than HTML) well established standards? Is anyone else using a different format?

此处的一个问题是在表单字段名称中处理[]。 PHP使用它来确保在 $ _ GET 超全局变量中出现多次出现的字段。 (否则只出现最后一次。)

A side question here is dealing with [] in form fields names. PHP uses that to ensure that multiple occurrences of a field are all present in $_GET superglobal variable. (Otherwise only last occurrence is present.)

但是来自 RFC 3986 似乎查询字符串中不允许使用 [] 。然而,我对各种浏览器的实验表明,没有浏览器对这些字符进行编码,它们就像URI那样...

But from RFC 3986 it seems that neither [ nor ] are allowed in query string. Yet my experiments with various browsers suggested that no browser encodes those characters and they are there in the URI just like that...

这是现实生活吗?或者我测试不正确?我在IIS 7上使用PHP 5.3.17进行了测试。使用Internet Explorer,Firefox和Chrome。然后我比较了 $ _ SERVER ['QUERY_STRING'] $ _ GET 中的内容。

Is this real life practice? Or am I testing it incorrectly? I tested with PHP 5.3.17 on IIS 7. Using Internet Explorer, Firefox and Chrome. Then I compared what is in $_SERVER['QUERY_STRING'] and $_GET.

另一个问题是对分号分离的现实支持。

Another question is real life support for semicolon separation.

HTML 4.01规范( B.2.2 URI属性值中的&符号)推荐使用HTTP服务器接受分号(; )作为参数分隔符(与&符号& 相对)。

HTML 4.01 specification (B.2.2 Ampersands in URI attribute values) recommends HTTP servers to accept semicolon (;) as parameter separator (opposed to ampersand &).

是否有任何服务器支持它?有人用这个吗?是否值得为此烦恼(在考虑允许的Web服务查询字符串格式时)?

Is any server supporting it? Is anyone using this? Is it worth to bother with that (when considering allowed formats of query string for a web service)?

那么怎么样非ASCII字符支持?

Then how about non-ASCII characters support?

HTML 4.01规范( B.2.1 URI属性值中的非ASCII字符)清楚地重述了首先描述RFC的URI:URI中不允许使用非ASCII字符。然而,规范考虑了现有的做法(使用非法URI)和将这些字符更改为UTF-8编码的建议,然后用URI标准的十六进制编码处理每个字节。

HTML 4.01 specification (B.2.1 Non-ASCII characters in URI attribute values) restates clearly what URI describing RFCs stated in the first place: non-ASCII characters are not allowed in URI. Yet specification takes into account existing practice (of use of illegal URIs) and advices to change such characters into UTF-8 encoding and then treat each byte with URI-standard hex encoding.

从我的测试看来,例如Chrome和Firefox就是这样做的。但Internet Explorer没有,只是发送了那些像他们一样的角色。 PHP部分应对了这一点。 $ _ SERVER ['QUERY_STRING'] $ _ GET 包含这些字符。但 $ _ SERVER ['REQUEST_URI'] 包含而不是。

From my tests is seems that for example Chrome and Firefox do so. But Internet Explorer did not and just sent those characters like they were. PHP partially coped with that. $_SERVER['QUERY_STRING'] and $_GET contained those characters. But $_SERVER['REQUEST_URI'] contained ? instead.

是否有任何标准或做法如何处理此类案件?

Are there any standards or practices how to approach such cases?

另一个相关的问题是作者应该如何发布(通过URI)名称包含非ASCII(例如国家)字符的资源?考虑到所有各方(HTML代码,浏览器发送请求,浏览器保存文件执行磁盘,服务器接收和处理请求以及存储文件的服务器),似乎几乎不可能让它始终如一地工作。或者至少我从来没有管理过。

And another connected question is how then should authors publish (by URI) resources with names containing non-ASCII (for example national) characters? Considering all the various parties (HTML code, browser sending request, browser saving file do disk, server receiving and processing request and server storing the file) it seems nearly impossible to have it working consistently. Or at least I never managed.

说到网页我已经习惯了,并且总是用相应的拉丁字母替换国家字符。但是当谈到外部文件(PDF,图像,......)时,降级这些名称在某种程度上感觉不对。特别是如果有人希望用户将这些文件保存在磁盘上。如何处理这个问题?

When it comes to web pages I’m already used to that and always replace national characters with corresponding Latin base characters. But when it comes to external files (PDFs, images, …) it somehow "feels wrong" to "downgrade" the names. Especially if one expects users to save those files on disk.. How to deal with this issue?

推荐答案


<事实上,RFC 3986(统一资源标识符(URI):通用语法)没有说明查询字符串片段的格式

In fact RFC 3986 (Uniform Resource Identifier (URI): Generic Syntax) doesn’t say anything about format of the query string fragment

是的,它在第3.4节中:

Yes, it does, in Section 3.4:

query       = *( pchar / "/" / "?" )

pchar 在第3.3节中定义:

pchar is defined in Section 3.3:

pchar         = unreserved / pct-encoded / sub-delims / ":" / "@"




结束于定义允许哪些字符以及如何编码其他字符。

and ends on defining which characters are allowed and how to encode other characters.

完全正确。这是定义查询字符串片段的格式。

Exactly. That is defining the format of the query string fragment.


但是从RFC 3986看来,查询字符串中似乎都不允许[nor]。

But from RFC 3986 it seems that neither [ nor ] are allowed in query string.

正式,是的。但并非所有浏览器都这样做,而这是他们自己的行为。我所看到的所有官方规格(以及3986不是唯一正在播放的规格)都说这些字符必须是百分比编码。

Officially, yes. But not all browsers do it, and that is broken behavior on their part. All official specs I have seen (and 3986 is not the only one in play) say those characters must be percent-encoded.


那么如何关于非ASCII字符支持?

Then how about non-ASCII characters support?

URI中不允许使用非ASCII字符。它们必须是字符集编码和百分比编码。使用的实际字符集是特定于服务器的,没有规范允许URI指定使用的字符集。各种规格推荐使用UTF-8,但不需要UTF-8,而且一些外国服务器确实不使用UTF-8。

Non-ASCII characters are not allowed in URIs. They must be charset-encoded and percent-encoded. The actual charset used is server-specific, there is no spec that allows a URI to specify the charset used. Various specs recommend UTF-8, but do not require UTF-8, and some foreign servers indeed do not use UTF-8.

IRI规范( RFC 3987 )取代了URL / URI规范,支持完整的Unicode字符集,但IRI仍然相对较新,许多服务器还不支持它们。但是,RFC确实定义了将IRI转换为URI的算法,反之亦然。

The IRI spec (RFC 3987), which replaces the URL/URI specs, supports the full Unicode charset, but IRIs are still relatively new and many servers do not support them yet. However, The RFC does define algorithms for converting IRIs to URIs and vice versa.

如果有疑问,百分比编码你不确定的所有内容。服务器在存在时需要支持对它们进行解码,然后根据需要处理解码数据。

When in doubt, percent-encode everything you are not sure about. Servers are required to support an decode them when present, before then processing the decoded data as needed.

这篇关于HTTP查询和URI编码存疑的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆