字符串编码转换UTF-8到SHIFT-JIS [英] String encoding conversion UTF-8 to SHIFT-JIS

查看:380
本文介绍了字符串编码转换UTF-8到SHIFT-JIS的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

使用的变量:


  • JavaSE-6

  • 无框架

鉴于此字符串输入ピーター·ジョーズ其中以 UTF-8 编码,我在将所述字符串转换为 Shift-JIS 时遇到问题,而无需将所述数据写入文件。

Given this string input of ピーター・ジョーズ which is encoded in UTF-8, I am having problems converting the said string to Shift-JIS without the need of writing the said data to a file.


  • 输入(UTF-8编码):ピーター·ジョーンズ

  • 输出(SHIFT-JIS编码):ピーター·ジョーンズ(SHIFT-JIS编码)

  • Input (UTF-8 encoding): ピーター・ジョーンズ
  • Output (SHIFT-JIS encoding): ピーター・ジョーンズ (SHIFT-JIS to be encoded)

我已尝试过如何将UTF-8字符串转换为SHIFT-JIS的代码片段:

I've tried this code snippets on how to convert UTF-8 strings to SHIFT-JIS:


  • stringToEncode.getBytes(Charset.forName(SHIFT-JIS))

  • new String(unecodedString.getBytes(SHIFT-JIS),UTF-8)

  • stringToEncode.getBytes(Charset.forName("SHIFT-JIS"))
  • new String(unecodedString.getBytes("SHIFT-JIS"), "UTF-8")

两个代码片段都返回此字符串输出: s [ ^ [ E W [ Y(SHIFT-JIS编码)

Both code snippets return this string output: �s�[�^�[�E�W���[���Y (SHIFT-JIS encoded)

关于如何解决这个问题的想法?

Any ideas on how this can be resolved?

推荐答案

在Java内部,字符串是作为UTF-16代码单元的数组实现的。但这是一个实现细节,可以在内部实现一个使用不同编码的JVM。

Internally in Java, Strings are implemented as an array of UTF-16 code units. But this is an implementation detail, it would be possible to implement a JVM that uses a different encoding internally.

(注意编码,charset和Charset等等或者更少的同义词。)

(Note "encoding", "charset" and Charset are more or less synonyms.)

字符串应该被视为一系列Unicode代码点(即使在Java中它是一系列UTF-16代码单元)。

A String should be treated as a sequence of Unicode codepoints (even though in Java it's a sequence of UTF-16 code units).

如果您的Java程序中有一个字符串,那么不正确表示它是UTF-8字符串或编码的字符串用UTF-8。这没有任何意义,除非你在讨论内部表示,这是无关紧要的。

If you have a String in your Java program, it is incorrect to say that it is a "UTF-8 String" or "String which is encoded in UTF-8". That does not make any sense, unless you're talking about the internal representation, which is irrelevant.

你可以拥有的是一串解码为字符串的字节如果你使用编码解码它,例如UTF-8或Shift-JIS。

What you can have is a sequence of bytes that decode to a String if you decode it using an encoding, such as UTF-8 or Shift-JIS.

或者你可以拥有一个字符串,如果你编码它就会编码成一个字节序列使用编码,如UTF-8或Shift-JIS。

Or you can have a String that encodes to a sequence of bytes if you encode it using an encoding, such as UTF-8 or Shift-JIS.

简而言之,编码或Charset是一对两个函数,编码和解码这样:

In short, an encoding or Charset is a pair of two functions, "encode" and "decode" such that:

// String -> encode -> bytes
byte[] bytes = string.getBytes(encoding);
// or using Charset
ByteBuffer byteBuffer = charset.encode(string);

// bytes -> decode -> String
String string = new String(bytes, encoding);
// or using Charset
String string = charset.decode(byteBuffer).toString();

所以如果你有一个使用UTF-8编码的byte []:

So if you have a byte[] that's encoded using UTF-8:

byte[] utf8Bytes = "ピーター・ジョーズ".getBytes("UTF-8");
// utf8Bytes now contains, in hexadecimal
// e3 83 94  e3 83 bc  e3 82 bf   (ピ ー タ)
// e3 83 bc  e3 83 bb  e3 82 b8   (ー ・ ジ)
// e3 83 a7  e3 83 bc  e3 82 ba   (ョ ー ズ)

您可以使用以下命令从这些字节创建字符串:

You can create a String from those bytes using:

String string = new String(utf8Bytes, "UTF-8");
// String now contains "ピーター・ジョーズ"

然后你可以将该字符串编码为Shift-JIS使用:

Then you can encode that String as Shift-JIS using:

byte[] shiftJisBytes = string.getBytes("Shift-JIS");
// shiftJisBytes now contains, in hexadecimal
// 83 73  81 5b  83 5e   (ピ ー タ)
// 81 5b  81 45  83 57   (ー ・ ジ)
// 83 87  81 5b  83 59   (ョ ー ズ)

由于这些字节代表使用编码的字符串 Shift-JIS ,尝试使用 UTF-8 解码将产生垃圾:

Since those bytes represent a string encoded using Shift-JIS, trying to decode using UTF-8 will produce garbage:

String garbage = new String(shiftJisBytes, "UTF-8")
// String now contains "�s�[�^�[�E�W���[�Y"
// � is the character decoded when given an invalid UTF-8 sequence
// 83 73 81 5b 83 5e   (� s � [ � ^)
// 81 5b 81 45 83 57   (� [ � E � W)
// 83 87 81 5b 83 59   (� � � [ � Y)

此外,请记住,如果您将字符串打印到输出,例如 System.out ,那么使用系统相关的系统默认编码将String转换为字节。您的系统默认值似乎是 UTF-8

Further, remember that if you print a string to an output, for example System.out, that will use the system default encoding that is system dependent to convert the String to bytes. It looks like your system default is UTF-8.

System.out.print(string);
// equivalent to:
System.out.write(string.getBytes(Charset.defaultCharset()));

然后,如果您的输出是例如Windows控制台,它将使用Windows控制台将这些字节转换为字符串在向您呈现之前,很可能是完全不同的编码(可能 CP437 CP850 )。

Then if your output is for example the Windows console, it will then convert those bytes to a String using very probably a completely different encoding (probably CP437 or CP850) before presenting it to you.

这最后一部分可能会让你失望。

This last part might be tripping you up.

这篇关于字符串编码转换UTF-8到SHIFT-JIS的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆