字符串编码转换UTF-8到SHIFT-JIS [英] String encoding conversion UTF-8 to SHIFT-JIS
问题描述
使用的变量:
- JavaSE-6
- 无框架
鉴于此字符串输入ピーター·ジョーズ
其中以 UTF-8 编码,我在将所述字符串转换为 Shift-JIS 时遇到问题,而无需将所述数据写入文件。
Given this string input of ピーター・ジョーズ
which is encoded in UTF-8, I am having problems converting the said string to Shift-JIS without the need of writing the said data to a file.
- 输入(UTF-8编码):
ピーター·ジョーンズ
- 输出(SHIFT-JIS编码):
ピーター·ジョーンズ
(SHIFT-JIS编码)
- Input (UTF-8 encoding):
ピーター・ジョーンズ
- Output (SHIFT-JIS encoding):
ピーター・ジョーンズ
(SHIFT-JIS to be encoded)
我已尝试过如何将UTF-8字符串转换为SHIFT-JIS的代码片段:
I've tried this code snippets on how to convert UTF-8 strings to SHIFT-JIS:
-
stringToEncode.getBytes(Charset.forName(SHIFT-JIS))
-
new String(unecodedString.getBytes(SHIFT-JIS),UTF-8)
stringToEncode.getBytes(Charset.forName("SHIFT-JIS"))
new String(unecodedString.getBytes("SHIFT-JIS"), "UTF-8")
两个代码片段都返回此字符串输出: s [ ^ [ E W [ Y
(SHIFT-JIS编码)
Both code snippets return this string output: �s�[�^�[�E�W���[���Y
(SHIFT-JIS encoded)
关于如何解决这个问题的想法?
Any ideas on how this can be resolved?
推荐答案
在Java内部,字符串是作为UTF-16代码单元的数组实现的。但这是一个实现细节,可以在内部实现一个使用不同编码的JVM。
Internally in Java, Strings are implemented as an array of UTF-16 code units. But this is an implementation detail, it would be possible to implement a JVM that uses a different encoding internally.
(注意编码,charset和Charset等等或者更少的同义词。)
(Note "encoding", "charset" and Charset are more or less synonyms.)
字符串应该被视为一系列Unicode代码点(即使在Java中它是一系列UTF-16代码单元)。
A String should be treated as a sequence of Unicode codepoints (even though in Java it's a sequence of UTF-16 code units).
如果您的Java程序中有一个字符串,那么不正确表示它是UTF-8字符串或编码的字符串用UTF-8。这没有任何意义,除非你在讨论内部表示,这是无关紧要的。
If you have a String in your Java program, it is incorrect to say that it is a "UTF-8 String" or "String which is encoded in UTF-8". That does not make any sense, unless you're talking about the internal representation, which is irrelevant.
你可以拥有的是一串解码为字符串的字节如果你使用编码解码它,例如UTF-8或Shift-JIS。
What you can have is a sequence of bytes that decode to a String if you decode it using an encoding, such as UTF-8 or Shift-JIS.
或者你可以拥有一个字符串,如果你编码它就会编码成一个字节序列使用编码,如UTF-8或Shift-JIS。
Or you can have a String that encodes to a sequence of bytes if you encode it using an encoding, such as UTF-8 or Shift-JIS.
简而言之,编码或Charset是一对两个函数,编码和解码这样:
In short, an encoding or Charset is a pair of two functions, "encode" and "decode" such that:
// String -> encode -> bytes
byte[] bytes = string.getBytes(encoding);
// or using Charset
ByteBuffer byteBuffer = charset.encode(string);
// bytes -> decode -> String
String string = new String(bytes, encoding);
// or using Charset
String string = charset.decode(byteBuffer).toString();
所以如果你有一个使用UTF-8编码的byte []:
So if you have a byte[] that's encoded using UTF-8:
byte[] utf8Bytes = "ピーター・ジョーズ".getBytes("UTF-8");
// utf8Bytes now contains, in hexadecimal
// e3 83 94 e3 83 bc e3 82 bf (ピ ー タ)
// e3 83 bc e3 83 bb e3 82 b8 (ー ・ ジ)
// e3 83 a7 e3 83 bc e3 82 ba (ョ ー ズ)
您可以使用以下命令从这些字节创建字符串:
You can create a String from those bytes using:
String string = new String(utf8Bytes, "UTF-8");
// String now contains "ピーター・ジョーズ"
然后你可以将该字符串编码为Shift-JIS使用:
Then you can encode that String as Shift-JIS using:
byte[] shiftJisBytes = string.getBytes("Shift-JIS");
// shiftJisBytes now contains, in hexadecimal
// 83 73 81 5b 83 5e (ピ ー タ)
// 81 5b 81 45 83 57 (ー ・ ジ)
// 83 87 81 5b 83 59 (ョ ー ズ)
由于这些字节代表使用编码的字符串 Shift-JIS
,尝试使用 UTF-8
解码将产生垃圾:
Since those bytes represent a string encoded using Shift-JIS
, trying to decode using UTF-8
will produce garbage:
String garbage = new String(shiftJisBytes, "UTF-8")
// String now contains "�s�[�^�[�E�W���[�Y"
// � is the character decoded when given an invalid UTF-8 sequence
// 83 73 81 5b 83 5e (� s � [ � ^)
// 81 5b 81 45 83 57 (� [ � E � W)
// 83 87 81 5b 83 59 (� � � [ � Y)
此外,请记住,如果您将字符串打印到输出,例如 System.out
,那么使用系统相关的系统默认编码将String转换为字节。您的系统默认值似乎是 UTF-8
。
Further, remember that if you print a string to an output, for example System.out
, that will use the system default encoding that is system dependent to convert the String to bytes. It looks like your system default is UTF-8
.
System.out.print(string);
// equivalent to:
System.out.write(string.getBytes(Charset.defaultCharset()));
然后,如果您的输出是例如Windows控制台,它将使用Windows控制台将这些字节转换为字符串在向您呈现之前,很可能是完全不同的编码(可能 CP437
或 CP850
)。
Then if your output is for example the Windows console, it will then convert those bytes to a String using very probably a completely different encoding (probably CP437
or CP850
) before presenting it to you.
这最后一部分可能会让你失望。
This last part might be tripping you up.
这篇关于字符串编码转换UTF-8到SHIFT-JIS的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!