什么字符必须在HTTP查询字符串中转义? [英] What characters must be escaped in an HTTP query string?

查看:202
本文介绍了什么字符必须在HTTP查询字符串中转义?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

此问题涉及URL的查询字符串部分中的字符,它出现在标记字符后面。



根据维基百科,某些字符保持不变,其他字符被编码(通常使用转义序列)。



我一直试图追踪到实际的规格,以便我理解理由是维基百科页面中的每一个重点。



矛盾示例1:

HTML规范说要编码作为 + 的空间并将其余部分推迟到 RFC1738 。然而,这个RFC表示是不安全的,而且不安全的字符必须总是在URL中编码。这似乎与维基百科矛盾。



实际上,IE8在它生成的查询字符串中编码,而FF3离开它是如此。



矛盾示例2:

维基百科表示,必须提及编码。 在维基百科中未提及。但 RFC1738 指出是一个特殊字符和可以使用未编码。这似乎与维基百科相矛盾,该维基百科说它必须被编码。在实践中,IE8在查询中编码它产生的字符串,而FF3保持原样。



我明白,这可能是编码那些有疑问的字符维基百科和规范。也许甚至会编码所有不是[A-Za-z0-9]的东西。我只想知道这方面的实际标准。



结论



维基百科上描述的算法精确地编码那些不是 RFC3986无保留字符的字符

一>。也就是说,它编码除字母数字和 -._〜以外的所有字符。作为一种特殊情况,根据RFC3986,空格被编码为 + ,而不是%20



某些应用程序使用较旧的RFC。作为比较, RFC2396未预留字符为字母数字,<! '()* -._〜。



为比较, HTML5工作草案算法对字母以外的所有字符进行编码, * -._ 。空间的特殊情况编码仍然是 + 。值得注意的区别在于 * 未被编码,〜被编码。 (从技术上讲,即使 * 位于保留字段中,< * 的处理与RFC3986兼容。 / code>,因为它位于查询生产中允许的子分隔符中。)

解决方案

答案在于RFC 3986文档,具体来说是第3.4节


查询组件由第一个问题
标记(?)字符并以数字符号(#)字符
结尾或在URI末尾结束。



...



在查询组件中,字符斜杠(/)和问号(?)可能代表数据

从技术上讲,RFC 3976-3.4将查询组件定义为:

  query = *(pchar ///?)

在包括来自 pchar 以及 / 的所有字符。 pchar 引用另一个路径字符规范。有帮助的是,RFC 3986的附录A 列出了相关的ABNF定义,最值得注意的是:

  query = *(pchar ///?)
pchar = unreserved / pct-encoded / sub-delims /: /@
unreserved = ALPHA / DIGIT / - /。 /_/〜
pct-encoded =%HEXDIG HEXDIG
sub-delims =! /$/& /'/(/)/*/+/,/;因此,除了所有字母数字和百分比编码字符之外,还包含所有字母数字和百分比编码字符 >,一个查询可以合法地包含以下未编码字符:

  /? :@  - 。 _〜! $& '()* +,; = 

当然,您可能需要记住'='和'&'在查询中具有特殊意义。


This question concerns the characters in the query string portion of the URL, which appear after the ? mark character.

Per Wikipedia, certain characters are left as is and others are encoded (usually with a % escape sequence).

I've been trying to track this down to actual specifications, so that I understand the justification behind every bullet point in that Wikipedia page.

Contradiction Example 1:

The HTML specification says to encode space as + and defers the rest to RFC1738. However, this RFC says that ~ is unsafe and furthermore that "[a]ll unsafe characters must always be encoded within the URL". This seems to contradict Wikipedia.

In practice, IE8 encodes ~ in the query strings it generates, while FF3 leaves it as is.

Contradiction Example 2:

Wikipedia states that all characters that it does not mention must be encoded. ! is not mentioned in Wikipedia. But RFC1738 states that ! is a "special" character and "may be used unencoded". This seems to contradict Wikipedia which says that it must be encoded.

In practice, IE8 encodes ! in the query strings it generates, while FF3 leaves it as is.

I understand that the moral of this is probably going to be to encode those characters that are in doubt between Wikipedia and the specifications. Perhaps even going as far as encoding everything that is not [A-Za-z0-9]. I would just like to know the actual standards on this.

Conclusions

The algorithm described on Wikipedia encodes precisely those characters which are not RFC3986 unreserved characters. That is, it encodes all characters other than alphanumerics and -._~. As a special case, space is encoded as + instead of %20 per RFC3986.

Some applications use an older RFC. For comparison, the RFC2396 unreserved characters are alphanumerics and !'()*-._~.

For comparison, the HTML5 working draft algorithm encodes all characters other than alphanumerics and *-._. The special case encoding for space remains +. Notable differences are that * is not encoded and ~ is encoded. (Technically, this handling of * is compatible with RFC3986 even though * is in reserved because it is in the sub-delims which are allowed in the query production.)

解决方案

The answer lies in the RFC 3986 document, specifically Section 3.4.

The query component is indicated by the first question mark ("?") character and terminated by a number sign ("#") character or by the end of the URI.

...

The characters slash ("/") and question mark ("?") may represent data within the query component.

Technically, RFC 3976-3.4 defines the query component as:

query       = *( pchar / "/" / "?" )

This syntax means that query can include all characters from pchar as well as / and ?. pchar refers to another specification of path characters. Helpfully, Appendix A of RFC 3986 lists the relevant ABNF definitions, most notably:

query         = *( pchar / "/" / "?" )
pchar         = unreserved / pct-encoded / sub-delims / ":" / "@"
unreserved    = ALPHA / DIGIT / "-" / "." / "_" / "~"
pct-encoded   = "%" HEXDIG HEXDIG
sub-delims    = "!" / "$" / "&" / "'" / "(" / ")" / "*" / "+" / "," / ";" / "="

Thus, in addition to all alphanumerics and percent encoded characters, a query can legally include the following unencoded characters:

/ ? : @ - . _ ~ ! $ & ' ( ) * + , ; =

Of course, you may want to keep in mind that '=' and '&' usually have special significance within a query.

这篇关于什么字符必须在HTTP查询字符串中转义?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆