URL编码百分比差异 [英] Discrepancies of Percent Encoding for URLs

查看:127
本文介绍了URL编码百分比差异的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

查看此页面后关于编码百分比的上一个SO问题,我很好奇编码的风格是正确的 - 暗示使用 + 而不是%20 为空格,而仍然有一个应用程序/ x-www-urlencoded 内容类型。

After viewing this previous SO question regarding percent encoding, I'm curious as to which styles of encodings are correct - the Wikipedia article on percent encoding alludes to using + instead of %20 for spaces, while still having an application/x-www-urlencoded content type.

这使我认为 + vs. %20 行为取决于URL的哪个部分正在编码。路径段与查询字符串有何不同之处?

This leads me to think the + vs. %20 behavior depends on which part of the URL is being encoded. What differences are preferred for path segments vs. query strings? Details and references for this specification would be greatly appreciated.

注意:我认为非字母数字字符将通过UTF-8进行编码,因为字符的每个八位字节将成为%XX 字符串。在这里纠正我(比如拉丁一而不是utf-8),但是我对URL的不同部分的编码之间的差异更感兴趣。

Note: I assume that non-alphanumeric characters will be encoded via UTF-8, in that each octet for a character becomes a %XX string. Correct me if I am wrong here (for instance latin-1 instead of utf-8), but I am more interested in the differences between the encodings of different parts of a URL.

推荐答案


这让我认为 + code>%20 行为取决于URL的哪个部分正在编码。

This leads me to think the + vs. %20 behavior depends on which part of the URL is being encoded.

不只有它取决于特定的URL组件,但它也取决于使用数据填充组件的情况。

Not only does it depend on the particular URL component, but it also depends on the circumstances in which that component is populated with data.

使用用于编码空格字符的'+'特定于 application / x-www-form-urlencoded 格式,适用于正在使用的Webform数据在HTTP请求中提交。它不适用于URL本身。

The use of '+' for encoding space characters is specific to the application/x-www-form-urlencoded format, which applies to webform data that is being submitted in an HTTP request. It does not apply to a URL itself.

应用程序/ x-www-form-urlencoded 格式正式由W3C在HTML规范中定义。以下是HTML 4.01的定义:

The application/x-www-form-urlencoded format is formally defined by W3C in the HTML specifications. Here is the definition from HTML 4.01:

第17.13.3节处理表单数据,第四步:提交编码表单数据集

此规范未指定可能与表单一起使用的所有有效提交方法或内容类型。但是,在以下情况下,HTML 4用户代理必须支持已建立的约定:

This specification does not specify all valid submission methods or content types that may be used with forms. However, HTML 4 user agents must support the established conventions in the following cases:

如果方法为get,该操作是HTTP URI,用户代理采取行动的价值,附加一个?,然后附加使用application / x-www-form-urlencoded内容类型编码的表单数据集。然后用户代理遍历指向此URI的链接。在这种情况下,表单数据仅限于ASCII代码。

If the method is "get" and the action is an HTTP URI, the user agent takes the value of action, appends a `?' to it, then appends the form data set, encoded using the "application/x-www-form-urlencoded" content type. The user agent then traverses the link to this URI. In this scenario, form data are restricted to ASCII codes.

•如果方法是post,并且操作是HTTP URI,则用户代理会执行HTTP发布事务,使用action属性的值和根据enctype属性指定的内容类型创建的消息。

• If the method is "post" and the action is an HTTP URI, the user agent conducts an HTTP "post" transaction using the value of the action attribute and a message created according to the content type specified by the enctype attribute.

第17.13.4节表格内容类型,应用程序/ x-www-form-urlencoded


这是默认的内容类型。使用此内容类型提交的表单必须编码如下:

This is the default content type. Forms submitted with this content type must be encoded as follows:

1.转义名称和值。 空格字符由+替换,然后按照[RFC1738] 2.2节中的描述转义保留字符:非字母数字字符由%HH替换,百分号和二进制数十进制表示字符ASCII码的数字。换行符表示为CR LF对(即'%0D%0A')。

1.Control names and values are escaped. Space characters are replaced by '+', and then reserved characters are escaped as described in [RFC1738], section 2.2: Non-alphanumeric characters are replaced by '%HH', a percent sign and two hexadecimal digits representing the ASCII code of the character. Line breaks are represented as "CR LF" pairs (i.e., '%0D%0A').

2.控制名称/值按照出现的顺序列出在文件中。该名称与'='的值分开,名称/值对由'&'分开。

2.The control names/values are listed in the order they appear in the document. The name is separated from the value by '=' and name/value pairs are separated from each other by '&'.

相应的HTML5定义(第4.10.22.3节表单提交算法第4.10.22.6节URL编码的表单数据)的方式更加精细和详细,但为了讨论的目的,jist大致相同。

The corresponding HTML5 definitions (Section 4.10.22.3 Form submission algorithm and Section 4.10.22.6 URL-encoded form data) are way more refined and detailed, but for purposes of this discussion, the jist is roughly the same.

因此,在通过HTTP GET 请求而不是 POST 请求提交Webform数据的情况下, webform数据使用 application / x-www-form-urlencoded 进行编码,并按原样放置在URL 查询组件中。

So, in the situation where the webform data is submitted via an HTTP GET request instead of a POST request, the webform data is encoded using application/x-www-form-urlencoded and placed as-is in the URL query component.

根据 RFC 3986:统一资源I牙齿(URI):通用语法


生成应用程序的URI应该对与保留的字符对应的数据字节进行百分比编码设置,除非URI方案特别允许这些字符表示该组件中的数据

'+'是一个保留字符:

reserved    = gen-delims / sub-delims

gen-delims  = ":" / "/" / "?" / "#" / "[" / "]" / "@"

sub-delims  = "!" / "$" / "&" / "'" / "(" / ")"
            / "*" / "+" / "," / ";" / "="

明确的查询允许未编码的'+'字符,因为它允许 sub-delims 中的字符:

The query component explicitly allows unencoded '+' characters, as it allows characters from sub-delims:

unreserved  = ALPHA / DIGIT / "-" / "." / "_" / "~"

pct-encoded = "%" HEXDIG HEXDIG

pchar       = unreserved / pct-encoded / sub-delims / ":" / "@"

query       = *( pchar / "/" / "?" )

所以,在Webform提交的上下文中,空格是使用'+'进行编码,然后被原样放入查询中组件。这是由URL语法允许的,因为 application / x-www-form-urlencoded 的编码形式与查询组件。

So, in the context of a webform submission, spaces are encoded using '+' prior to then being put as-is into the query component. This is allowed by the URL syntax, since the encoded form of application/x-www-form-urlencoded is compatible with the definition of the query component.

所以,例如: http:// server / script?field = hello + world

但是,在Webform提交之外,将空格字符直接放入查询组件中需要 pct-encoded 的使用,因为''不包括在 unreserved sub-delims ,并且未明确允许查询定义。

However, outside of a webform submission, putting a space character directly into the query component requires the use of pct-encoded, since ' ' is not included in either unreserved or sub-delims, and is not explicitly allowed by the query definition.

所以,例如: http:// server / script?hello%20world

由于使用 pchar

  path          = path-abempty    ; begins with "/" or is empty
                / path-absolute   ; begins with "/" but not "//"
                / path-noscheme   ; begins with a non-colon segment
                / path-rootless   ; begins with a segment
                / path-empty      ; zero characters

  path-abempty  = *( "/" segment )
  path-absolute = "/" [ segment-nz *( "/" segment ) ]
  path-noscheme = segment-nz-nc *( "/" segment )
  path-rootless = segment-nz *( "/" segment )
  path-empty    = 0<pchar>
  segment       = *pchar
  segment-nz    = 1*pchar
  segment-nz-nc = 1*( unreserved / pct-encoded / sub-delims / "@" )
                ; non-zero-length segment without any colon ":"

所以,虽然路径允许未编码的子代码字符,'+'字符被处理原样,不是编码空间。 application / x-www-form-urlencoded 不与路径组件一起使用,因此空格字符必须由于 pchar segment-nz-nc

So, although path does allow for unencoded sub-delims characters, a '+' character gets treated as-is, not as an encoded space. application/x-www-form-urlencoded is not used with the path component, so a space character has to be encoded as %20 due to the definitions of pchar and segment-nz-nc.

现在,关于用于编码字符的字符集 -

Now, regarding the charset used to encode characters -

对于Webform提交,该字符集由Webform编码算法中定义的规则(在HTML5中比HTML4更多)中规定,用于在将Webform数据插入到URL之前准备Webform数据。简而言之,HTML可以直接在<$ c $中指定 accept-charset 属性或隐藏 _charset _ 字段c>< form> 本身,否则字符集通常是父HTML使用的字符集。

For a webform submission, that charset is dictated by rules defined in the webform encoding algorithm (more so in HTML5 than HTML4) used to prepare the webform data prior to inserting it into the URL. In a nutshell, the HTML can specify an accept-charset attribute or hidden _charset_ field directly in the <form> itself, otherwise the charset is typically the charset used by the parent HTML.

然而,在Webform提交之外,没有用于在URL组件中编码非ASCII字符的字符集的正式标准( IRI 语法,另一方面,要求UTF-8特别是将IRI转换为URI / URL时)。在IRI之外,由特定的URI方案决定它们的字符集(HTTP方案不),否则服务器决定要使用哪个字符集。大多数方案/服务器现在使用UTF-8,但仍然有一些服务器/方案使用其他字符集,通常基于服务器的区域设置(Latin1,Shift-JIS等)。已经尝试直接在URL和/或HTTP中添加字符集报告(例如确定性URI编码
),但这些不常用。

However, outside of a webform submission, there is no formal standard for which charset is used to encode non-ascii characters in a URL component (the IRI syntax, on the other hand, requires UTF-8 especially when converting an IRI into an URI/URL). Outside of IRI, it is up to particular URI schemes to dictate their charsets (the HTTP scheme does not), otherwise the server decides which charset it wants to use. Most schemes/servers use UTF-8 nowadays, but there are still some servers/schemes that use other charsets, typically based on the server's locale (Latin1, Shift-JIS, etc). There have been attempts to add charset reporting directly in the URL and/or in HTTP (such as Deterministic URI Encoding ), but those are not commonly used.

这篇关于URL编码百分比差异的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆