JSON规范和BOM/字符集编码的用法 [英] JSON Specification and usage of BOM/charset-encoding

查看:126
本文介绍了JSON规范和BOM/字符集编码的用法的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我一直在阅读 RFC-4627 规范,我来了解释:

I've been reading up on the RFC-4627 specification, and I've come to interpretation:

在将有效载荷发布为application/json mime类型时,

When advertising a payload as application/json mime-type,

  1. 在经过正确编码的JSON流(基于"3.编码"部分的开头)的开头没有必须,并且
  2. 不支持媒体参数,因此application/json; charset=utf-8的mime类型标头符合 RFC-4627 (基于"6. IANA注意事项"部分).
  1. there MUST be no BOMs at the beginning of properly encoded JSON streams (based on section "3. Encoding"), and
  2. no media parameters are supported, thus a mime-type header of application/json; charset=utf-8 does not conform to RFC-4627 (based on section "6. IANA Considerations").

这些正确的推论吗?在实施遵循这种解释的Web服务或Web客户端时,我会遇到问题吗?是否应该针对违反上述两个属性的Web浏览器提交错误?

Are these correct deductions? Will I run into problem when implementing web-services or web-clients which adhere to this interpretations? Should I file bugs against web browsers which violate the the two properties above?

推荐答案

您是对的

  1. BOM字符在JSON中是非法的(不需要)
  2. MIME字符集在JSON中是非法的(也不需要)
  1. The BOM character is illegal in JSON (and not needed)
  2. The MIME charset is illegal in JSON (and not needed as well)

RFC 7159第8.1节:

实现不得在JSON文本的开头添加字节顺序标记.

这是尽可能清楚地说明.这是整个RFC中唯一的不得".

This is put as clearly as it can be. This is the only "MUST NOT" in the entire RFC.

RFC 7159,第11部分:

JSON文本的MIME媒体类型为application/json.
类型名称:应用程序
子类型名称:json
必需的参数:n/a
可选参数:n/a
[...]
注意:没有为该注册定义字符集"参数.

The MIME media type for JSON text is application/json.
Type name: application
Subtype name: json
Required parameters: n/a
Optional parameters: n/a
[...]
Note: No "charset" parameter is defined for this registration.

JSON编码

JSON的唯一有效编码是UTF-8,UTF-16或UTF-32,并且由于第一个字符(如果有多个字符,则为前两个字符)将始终具有小于128的Unicode值(存在没有有效的JSON文本可以包含前两个字符的较高值),总是可以通过查看字节流来知道使用哪种有效编码以及使用了哪种字节序.

JSON encoding

The only valid encodings of JSON are UTF-8, UTF-16 or UTF-32 and since the first character (or first two if there is more than one character) will always have a Unicode value lower than 128 (there is no valid JSON text that can include higher values of the first two characters) it is always possible to know which of the valid encodings and which endianness was used just by looking at the byte stream.

JSON RFC指出前两个字符将始终在128以下,并且您应检查前4个字节.

The JSON RFC says that the first two characters will always be below 128 and you should check the first 4 bytes.

我要换个说法:由于字符串"1"也是有效的JSON,因此不能保证您有两个字符-更不用说4个字节了.

I would put it differently: since a string "1" is also valid JSON there is no guarantee that you have two characters at all - let alone 4 bytes.

我对确定JSON编码的建议会稍有不同:

My recommendation of determining the JSON encoding would be slightly different:

快速方法:

  1. 如果您有1个字节且不是NUL,则为 UTF-8
    (实际上,这里唯一有效的字符是ASCII数字)
  2. 如果您有2个字节,但都不是NUL,则为 UTF-8
    (这些必须是不带前导'0',{}[]""的ASCII数字)
  3. 如果您有2个字节,而只有第一个是NUL,则为 UTF-16BE
    (必须为ASCII数字,编码为UTF-16,大端)
  4. 如果您有2个字节,而只有第二个字节是NUL,则为 UTF-16LE
    (必须为ASCII数字,编码为UTF-16,小端)
  5. 如果您有3个字节,但它们不是NUL,则为 UTF-8
    (同样,不带前导0的ASCII数字,"x"[1]等)
  6. 如果您有4个字节或更多的字节,而不是RFC方法有效:
    • 00 00 00 xx-它是UTF-32BE
    • 00 xx 00 xx-它是UTF-16BE
    • xx 00 00 00-它是UTF-32LE
    • xx 00 xx 00-它是UTF-16LE
    • xx xx xx xx-它是UTF-8
  1. if you have 1 byte and it's not NUL - it's UTF-8
    (actually the only valid character here would be an ASCII digit)
  2. if you have 2 bytes and none of them are NUL - it's UTF-8
    (those must be ASCII digits with no leading '0', {}, [] or "")
  3. if you have 2 bytes and only the first is NUL - it's UTF-16BE
    (it must be an ASCII digit encoded as UTF-16, big endian)
  4. if you have 2 bytes and only the second is NUL - it's UTF-16LE
    (it must be an ASCII digit encoded as UTF-16, little endian)
  5. if you have 3 bytes and they are not NUL - it's UTF-8
    (again, ASCII digits with no leading '0's, "x", [1] etc.)
  6. if you have 4 bytes or more than the RFC method works:
    • 00 00 00 xx - it's UTF-32BE
    • 00 xx 00 xx - it's UTF-16BE
    • xx 00 00 00 - it's UTF-32LE
    • xx 00 xx 00 - it's UTF-16LE
    • xx xx xx xx - it's UTF-8

,但是只有在任何一种编码中它确实是有效字符串(可能不是)时,它才起作用.而且,即使您使用5种有效编码中的一种有效字符串,也可能不是有效的JSON.

but it only works if it is indeed a valid string in any of those encodings, which it may not be. Moreover, even if you have a valid string in one of the 5 valid encodings, it may still not be a valid JSON.

我的建议是进行比RFC所包含的验证更严格的验证,以验证您的身份:

My recommendation would be to have a slightly more rigid verification than the one included in the RFC to verify that you have:

  1. UTF-8,UTF-16或UTF-32(LE或BE)的有效编码
  2. 有效的JSON

仅查找NUL字节是不够的.

Looking only for NUL bytes is not enough.

话虽这么说,在任何时候您都不需要任何BOM字符来确定编码,也不需要MIME字符集-两者都不需要,并且在JSON中无效.

That having been said, at no point you need to have any BOM characters to determine the encoding, neither you need MIME charset - both of which are not needed and not valid in JSON.

使用UTF-16和UTF-32时只需要使用二进制内容传输编码,因为它们可能包含NUL字节. UTF-8没有这个问题,并且8位内容传输编码很好,因为它在字符串中不包含NUL(尽管它仍然包含> = 128的字节,所以7位传输将不起作用-存在UTF- 7可以用于这种传输,但是它不是有效的JSON,因为它不是唯一有效的JSON编码之一.

You only have to use the binary content-transfer-encoding when using UTF-16 and UTF-32 because those may contain NUL bytes. UTF-8 doesn't have that problem and 8bit content-transfer-encoding is fine as it doesn't contain NUL in the string (though it still contains bytes >= 128 so 7-bit transfer will not work - there is UTF-7 that would work for such a transfer but it wouldn't be valid JSON, as it is not one of the only valid JSON encodings).

另请参见此答案了解更多信息.

See also this answer for more details.

这些正确的推论吗?

Are these correct deductions?

是的

在实施遵循这种解释的Web服务或Web客户端时,我会遇到问题吗?

Will I run into problem when implementing web-services or web-clients which adhere to this interpretations?

可能,如果您与不正确的实现进行交互.为了与不正确的实现实现互操作性,您的实现可能会忽略BOM-请参见 RFC 7159第1.8节:

Possibly, if you interact with incorrect implementations. Your implementation MAY ignore the BOM for the sake of interoperability with incorrect implementations - see RFC 7159, Section 1.8:

为了互操作性,实现 解析JSON文本的方法可能会忽略字节顺序标记的存在 而不是将其视为错误.

In the interests of interoperability, implementations that parse JSON texts MAY ignore the presence of a byte order mark rather than treating it as an error.

此外,忽略MIME字符集是兼容JSON实现的预期行为-请参见 RFC 7159 ,第11节:

Also, ignoring the MIME charset is the expected behavior of compliant JSON implementations - see RFC 7159, Section 11:

注意:没有为该注册定义字符集"参数. 确实添加一个对符合条件的收件人没有任何影响.

Note: No "charset" parameter is defined for this registration. Adding one really has no effect on compliant recipients.

安全注意事项

我个人并不认为始终需要静默接受不正确的JSON流.如果您决定接受带有BOM和/或MIME字符集的输入,则必须回答以下问题:

Security considerations

I am not personally convinced that silently accepting incorrect JSON streams is always desired. If you decide to accept input with BOM and/or MIME charset then you will have to answer those questions:

  • 如果MIME字符集和实际编码不匹配怎么办?
  • 如果BOM和MIME字符集不匹配怎么办?
  • 如果BOM与实际编码不匹配怎么办?
  • 所有人都不一样时该怎么办?
  • 如何处理UTF-8/16/32以外的编码?
  • 您确定所有安全检查都能按预期进行吗?

在三个独立的位置定义了编码-在JSON字符串本身,BOM和MIME字符集中,这使得不可避免的问题是:如果他们不同意该怎么办.除非您拒绝这样的输入,否则没有一个明显的答案.

Having the encoding defined in three independent places - in a JSON string itself, in the BOM and in the MIME charset makes the question inevitable: what to do if they disagree. And unless you reject such an input then there is no one obvious answer.

例如,如果您有一个验证JSON字符串的代码,以查看是否可以安全地用JavaScript评估它-可能会被MIME字符集或BOM误导,并且将其视为与实际不同的编码,并且无法检测到将使用正确编码进行检测的字符串. (HTML的类似问题过去也导致XSS攻击.)

For example, if you have a code that verifies the JSON string to see if it's safe to eval it in JavaScript - it might be misled by the MIME charset or the BOM and treat is as a different encoding than it actually is and not detect strings that it would detect if it used the correct encoding. (A similar problem with HTML has led to XSS attacks in the past.)

每当您决定接受带有多个且可能有冲突的编码指示符的不正确的JSON字符串时,都必须为所有这些可能性做好准备.并不是说您绝对不要这样做,因为您可能需要使用由不正确的实现生成的输入.我只是说,您需要彻底考虑其含义.

You have to be prepared for all of those possibilities whenever you decide to accept incorrect JSON strings with multiple and possibly conflicting encoding indicators. It's not to say that you should never do that because you may need to consume input generated by incorrect implementations. I'm just saying that you need to thoroughly consider the implications.

我应该针对违反上述两个属性的Web浏览器提交错误吗?

Should I file bugs against web browsers which violate the the two properties above?

当然-如果他们将其称为JSON并且实现不符合JSON RFC,那么这是一个错误,应这样报告.

Certainly - if they call it JSON and the implementation doesn't conform to the JSON RFC then it is a bug and should be reported as such.

您是否找到了不符合JSON规范的任何特定实现,但它们却宣传这样做呢?

Have you found any specific implementations that doesn't conform to the JSON specification and yet they advertise to do so?

这篇关于JSON规范和BOM/字符集编码的用法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆