以UTF-16或UTF-32编码JSON [英] Encoding JSON in UTF-16 or UTF-32

查看:575
本文介绍了以UTF-16或UTF-32编码JSON的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

JSON RFC (第2.5节)部分说明了

The JSON RFC, section 2.5, says in part:


要逃避不在基本多语言
平面中的扩展字符,字符表示为十二个字符的序列,
编码UTF-16代理对。因此,例如,仅包含G谱号字符(U + 1D11E)的字符串
可以表示为
\\\�\\\�。

To escape an extended character that is not in the Basic Multilingual Plane, the character is represented as a twelve-character sequence, encoding the UTF-16 surrogate pair. So, for example, a string containing only the G clef character (U+1D11E) may be represented as "\uD834\uDD1E".

假设我有一个有效的理由将JSON编码为UTF-16BE(允许)。在这样做时,是否仍需要转义不在基本多语言平面中的字符?例如,而不是这样:

Assume I have a valid reason to encode JSON as UTF-16BE (which is allowed). When doing so, is it still necessary to escape characters that are not in the Basic Multilingual Plane? E.g., instead of this:

00 5C 00 75 00 44 00 38 00 33 00 34 00 5C 00 75 00 44 00 44 00 31 00 45
  \     u     D     8     3     4     \     u     D     D     1     E

它是 \\\�\\\� 的24字节UTF-16BE字节序列,这是合法的:

which is the 24-byte UTF-16BE byte sequence for \uD834\uDD1E, is it legal to do this:

D8 34 DD 1E

直接使用4字节的UTF-16BE值?

i.e., use the 4-byte UTF-16BE values directly?

同样,如果我将相同的JSON字符串编码为UTF-32BE, :

Similarly, if I were to encode the same JSON string as UTF-32BE, could I simply use the code-point value directly:

00 01 D1 1E

推荐答案

据我所知,可以写UTF-值直接。支持:您引用的RFC段落解释了如果您决定转义任意Unicode ,如何转义。但是,在该部分的早期,RFC表示

As far as I can tell, yes, you can write the UTF-16 values directly. Support: the RFC paragraph you quoted explains how to escape arbitrary Unicode if you have decided to escape it. However, earlier in that same section, the RFC says


全部 可以置于报价单
标记中,除非必须转义的字符:quotation
mark,reverse solidus ,以及控制字符(U + 0000至
U + 001F)。

All Unicode characters may be placed within the quotation marks except for the characters that must be escaped: quotation mark, reverse solidus, and the control characters (U+0000 through U+001F).

任何字符 >被转义。如果字符在
基本多语言平面(U + 0000到U + FFFF)中,那么它可能是
表示为六个字符的序列...

Any character may be escaped. If the character is in the Basic Multilingual Plane (U+0000 through U+FFFF), then it may be represented as a six-character sequence...

(强调已添加。)

对我来说,只有 code>, \ 和控制字符必须转义 ,并且任何其他Unicode字符可以直接放入JSON文本(以您使用的任何UTF格式)。它也告诉我,即使你编码为UTF-8,你不需要使用除 \ 之外的任何Unicode字符的 \uXXXX c $ c>和控制字符。

To me, this says that only ", \ and control characters must be escaped, and that any other Unicode characters may be placed as-is directly into the JSON text (in whatever UTF form you are using). It also says to me that even if you're encoding as UTF-8, you don't need to use the \uXXXX form for any Unicode character other than ", \, and control characters.

(另外,这让我想知道 \uXXXX form实际上对除了控制字符之外的任何东西都有用。另一个海报说,这可能是你的JSON解析器实际支持的。)

(As an aside, this does make me wonder whether the \uXXXX form is actually useful for anything other than control characters. As the other poster said, it probably comes down to what your JSON parser actually supports.)

这篇关于以UTF-16或UTF-32编码JSON的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆