为什么Android的日元(U + 00A5)符号的Shift-JIS编码会产生-4,-4? [英] Why does Android Shift-JIS encoding of Yen (U+00A5) symbol produce -4,-4 ?

查看:157
本文介绍了为什么Android的日元(U + 00A5)符号的Shift-JIS编码会产生-4,-4?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

运行以下代码似乎会生成错误的值:

byte[] data = "\u00a5".getBytes("Shift_JIS");

它产生[-4,-4],但我希望[0x5c]

我尝试了各种替代名称"Shift-JIS","shift_jis","cp932",并且都产生相同的结果.

将结果数据输入Shift-JIS解码器时,出现异常:java.nio.charset.UnmappableCharacterException: Length: 2

也就是说,解码器的配置如下:

Charset charset = Charset.forName("Shift_JIS);
        CharsetDecoder decoder = charset.newDecoder()
                .onMalformedInput(CodingErrorAction.REPORT)
                .onUnmappableCharacter(CodingErrorAction.REPORT);

但是鉴于编码器的输出看起来不正确,我的猜测是解码器是无关紧要的.我的观点是,无论实际字节数是多少,编码器都会生成无法解码的数据.

全宽日元(U + FFE5)编码为[-127(0x81),-113(0x8F)],并正确解码.

奇怪的是,如果我尝试解码[92(0x5C)],这就是我认为单宽度Yen的Shift-JIS编码,则Android/Java解码器会产生反斜杠,将字符保留为92./p>

如果编码器不支持给定字符,则我希望使用替换字符,例如'?'.但是-4(0xFC)甚至似乎都不是Shift-JIS.它甚至不是Unicode替换字符U + FFFD. 使用以下行,我可以看到编码器似乎已配置为使用[-4,-4]:

Charset.forName("Shift_JIS").newEncoder().replacement()

  • 那为什么在Shift-JIS中不映射单个宽度的日元呢?
  • [-4,-4]是否适合更换编码器?
  • 为什么解码器不支持将0x5C映射到日元(U + 00A5)?
  • 如果0x5C编码不正确,那是什么?

解决方案

部分答案:当Microsoft为Windows创建其东亚代码页(例如日语代码页932和朝鲜语949)时,他们将字节呈现为货币符号(分别为日元符号或韩元符号),同时在语法上仍充当文件路径中的反斜杠字符(因此日语系统上的文件路径可能看起来像

 C:¥Documents¥something.doc
 

).因此,该字节在某种意义上是日元符号,在某种意义上也是反斜杠;根据 WhatWG具有规范它的规范.)

关于当要求在shift_jis中对日元符号进行编码时,特别是Java/Android正在做什么的细节,恐怕我不知道.

Running the following code seems to generate the wrong values:

byte[] data = "\u00a5".getBytes("Shift_JIS");

It produces [ -4, -4 ], but I expect [ 0x5c ]

I've tried various alternative names, "Shift-JIS", "shift_jis", "cp932" and all produce the same result.

When I feed the resulting data into the Shift-JIS decoder, I get an exception: java.nio.charset.UnmappableCharacterException: Length: 2

That is, with the decoder configured as follows:

Charset charset = Charset.forName("Shift_JIS);
        CharsetDecoder decoder = charset.newDecoder()
                .onMalformedInput(CodingErrorAction.REPORT)
                .onUnmappableCharacter(CodingErrorAction.REPORT);

But given the output of the encoder looks wrong, my guess is that the decoder is irrelevant. My point is that regardless of the actual bytes, the encoder generates data that it can't decode.

The full width Yen (U+FFE5) encodes to [ -127 (0x81), -113 (0x8F) ], and decodes correctly.

Strangely, if I try to decode [ 92 (0x5C) ] which is what I think the Shift-JIS encoding of the single width Yen is, the Android/Java decoder produces a back slash, leaving the character as 92.

If the encoder didn't support a given character, I would expect a replacement character such as '?'. But -4 (0xFC) doesn't even seem to be valid Shift-JIS. It's not even the Unicode replacement character U+FFFD. Using the following line I can see that the encoder seems to be configured to use [-4, -4]:

Charset.forName("Shift_JIS").newEncoder().replacement()

  • So why isn't the single width Yen mapped in Shift-JIS?
  • Is [-4, -4] a sensible encoder replacement?
  • Why doesn't the decoder support 0x5C mapping to Yen (U+00A5)?
  • If 0x5C is not the correct encoding, what is?

解决方案

A partial answer: back when Microsoft created its east-Asian code pages for Windows, like the Japanese code page 932 and Korean 949, they made the byte 0x5C render as a currency symbol (either a Yen sign or Won sign respectively) while still syntactically acting as a backslash character in file paths (so that a file path on a Japanese system might look like

C:¥Documents¥something.doc

). Thus the byte was in a sense a Yen sign, but also in a sense a backslash; the same byte was even rendered as a different one of these symbols depending upon the font when on a Japanese system, according to http://archives.miloush.net/michkap/archive/2005/09/17/469941.html.

The lack of a consistent meaning of the symbol within the encoding means that while a Shift-JIS encoder can sensibly map both \ and ¥ to the byte 0x5C, a decoder trying to map a Shift-JIS-encoded string to a sequence of unicode code points has no way of knowing whether to convert the byte 0x5C to a backslash or to a yen sign; Japanese users used to make that choice via their font selection (if they were able to make it at all).

In the face of this unfixable ambiguity, all decoders seem to choose to decode 0x5C to a backslash. (At least, Python does this, and the WhatWG have a spec that dictates it.)

As for the details of what Java/Android in particular are doing when asked to encode a Yen sign in shift_jis, I'm afraid I don't know.

这篇关于为什么Android的日元(U + 00A5)符号的Shift-JIS编码会产生-4,-4?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆