将字符串从一个字符集转换为另一个 [英] Converting String from One Charset to Another

查看:132
本文介绍了将字符串从一个字符集转换为另一个的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在将一个字符串从一个字符集转换为另一个字符集,并阅读了许多示例,最后找到下面的代码,这对我来说很不错,作为Charset Encoding的新手,我想知道它是否正确做到的方式.

I am working on converting a string from one charset to another and read many example on it and finally found below code, which looks nice to me and as a newbie to Charset Encoding, I want to know, if it is the right way to do it .

public static byte[] transcodeField(byte[] source, Charset from, Charset to) {
    return new String(source, from).getBytes(to);
} 

要将String从 ASCII转换为EBCDIC ,我必须这样做:

To convert String from ASCII to EBCDIC, I have to do:

System.out.println(new String(transcodeField(ebytes,
                Charset.forName("US-ASCII"), Charset.forName("Cp1047"))));

要从 EBCDIC转换为ASCII ,我必须这样做:

System.out.println(new String(transcodeField(ebytes,
                Charset.forName("Cp1047"), Charset.forName("US-ASCII"))));

推荐答案

您发现的代码(transcodeField)不会将String从一种编码转换为另一种编码,因为String没有编码¹.它将字节从一种编码转换为另一种编码.仅当您的用例满足以下两个条件时,该方法才有用:

The code you found (transcodeField) doesn't convert a String from one encoding to another, because a String doesn't have an encoding¹. It converts bytes from one encoding to another. The method is only useful if your use case satisfies 2 conditions:

  1. 您的输入数据为字节
  2. 您的输出数据必须为字节

在这种情况下,很简单:

In that case, it's straight forward:

byte[] out = transcodeField(inbytes, Charset.forName(inEnc), Charset.forName(outEnc));

如果输入数据包含无法用输出编码表示的字符(例如将复杂的UTF8转换为ASCII),则这些字符将被替换为?

If the input data contains characters that can't be represented in the output encoding (such as converting complex UTF8 to ASCII) those characters will be replaced with the ? replacement symbol, and the data will be corrupted.

但是很多人询问"很多人使用以下代码段进行回答:

However a lot of people ask "How do I convert a String from one encoding to another", to which a lot of people answer with the following snippet:

String s = new String(source.getBytes(inputEncoding), outputEncoding);

这是完整的公牛****. getBytes(String encoding)方法返回一个字节数组,其中包含以指定编码编码的字符(如果可能,无效字符会再次转换为?).具有第二个参数的String构造函数从字节数组创建一个新的String,其中字节采用指定的编码.现在,由于您只是使用source.getBytes(inputEncoding)来获取这些字节,因此它们未在outputEncoding中进行 编码(除非编码使用相同的值,这对于abcd这样的常规"字符是常见的>,但与其他更复杂的字符(如重音字符éêäöñ)不同.

This is complete bull****. The getBytes(String encoding) method returns a byte array with the characters encoded in the specified encoding (if possible, again invalid characters are converted to ?). The String constructor with the 2nd parameter creates a new String from a byte array, where the bytes are in the specified encoding. Now since you just used source.getBytes(inputEncoding) to get those bytes, they're not encoded in outputEncoding (except if the encodings use the same values, which is common for "normal" characters like abcd, but differs with more complex like accented characters éêäöñ).

那是什么意思?这意味着当您拥有Java String时,一切都很好. Strings是unicode,表示您所有的字符都是安全的.当您需要将该String转换为字节时,就会出现问题,这意味着您需要确定编码.选择与unicode兼容的编码,例如UTF8UTF16等是很好的.这意味着即使您的字符串包含各种奇怪的字符,您的字符仍将是安全的.如果您选择其他编码(US-ASCII支持最少),则您的String必须仅包含编码支持的字符,否则将导致字节损坏.

So what does this mean? It means that when you have a Java String, everything is great. Strings are unicode, meaning that all of your characters are safe. The problem comes when you need to convert that String to bytes, meaning that you need to decide on an encoding. Choosing a unicode compatible encoding such as UTF8, UTF16 etc. is great. It means your characters will still be safe even if your String contained all sorts of weird characters. If you choose a different encoding (with US-ASCII being the least supportive) your String must contain only the characters supported by the encoding, or it will result in corrupted bytes.

现在终于有了一些使用得当和不好的例子.

Now finally some examples of good and bad usage.

String myString = "Feng shui in chinese is 風水";
byte[] bytes1 = myString.getBytes("UTF-8");  // Bytes correct
byte[] bytes2 = myString.getBytes("US-ASCII"); // Last 2 characters are now corrupted (converted to question marks)

String nordic = "Här är några merkkejä";
byte[] bytes3 = nordic.getBytes("UTF-8");  // Bytes correct, "weird" chars take 2 bytes each
byte[] bytes4 = nordic.getBytes("ISO-8859-1"); // Bytes correct, "weird" chars take 1 byte each
String broken = new String(nordic.getBytes("UTF-8"), "ISO-8859-1"); // Contains now "Här är några merkkejä"

最后一个示例说明,即使两种编码都支持北欧字符,但它们使用不同的字节来表示它们.因此,没有将String从一种编码转换为另一种编码的事情,并且永远不要使用损坏的示例.

The last example demonstrates that even though both of the encodings support the nordic characters, they use different bytes to represent them. Therefore there's no such thing as converting a String from one encoding to another, and you should never use the broken example.

还请注意,您应该始终指定使用的编码(同时使用getBytes()new String()),因为您不能相信默认编码始终是您想要的编码.

Also note that you should always specify the encoding used (with both getBytes() and new String()), because you can't trust that the default encoding is always the one you want.

最后一个问题是,字符集和编码不是同样的东西,但是它们之间有很大的联系.

As a last issue, Charset and Encoding aren't the same thing, but they're very much related.

¹从技术上讲,字符串在JVM内部的存储方式是采用Java 8以下的UTF-16编码,并且变量编码,但开发人员无需担心.

¹ Technically the way a String is stored internally in the JVM is in UTF-16 encoding up to Java 8, and variable encoding from Java 9 onwards, but the developer doesn't need to care about that.

这篇关于将字符串从一个字符集转换为另一个的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆