将字符串从一个字符集转换为另一个 [英] Converting String from One Charset to Another
问题描述
我正在将一个字符串从一个字符集转换为另一个字符集,并阅读了许多示例,最后找到下面的代码,这对我来说很不错,作为Charset Encoding的新手,我想知道它是否正确做到的方式.
I am working on converting a string from one charset to another and read many example on it and finally found below code, which looks nice to me and as a newbie to Charset Encoding, I want to know, if it is the right way to do it .
public static byte[] transcodeField(byte[] source, Charset from, Charset to) {
return new String(source, from).getBytes(to);
}
要将String从 ASCII转换为EBCDIC ,我必须这样做:
To convert String from ASCII to EBCDIC, I have to do:
System.out.println(new String(transcodeField(ebytes,
Charset.forName("US-ASCII"), Charset.forName("Cp1047"))));
要从 EBCDIC转换为ASCII ,我必须这样做:
System.out.println(new String(transcodeField(ebytes,
Charset.forName("Cp1047"), Charset.forName("US-ASCII"))));
推荐答案
您发现的代码(transcodeField
)不会将String
从一种编码转换为另一种编码,因为String
没有编码¹.它将字节从一种编码转换为另一种编码.仅当您的用例满足以下两个条件时,该方法才有用:
The code you found (transcodeField
) doesn't convert a String
from one encoding to another, because a String
doesn't have an encoding¹. It converts bytes from one encoding to another. The method is only useful if your use case satisfies 2 conditions:
- 您的输入数据为字节
- 您的输出数据必须为字节
在这种情况下,很简单:
In that case, it's straight forward:
byte[] out = transcodeField(inbytes, Charset.forName(inEnc), Charset.forName(outEnc));
如果输入数据包含无法用输出编码表示的字符(例如将复杂的UTF8
转换为ASCII
),则这些字符将被替换为?
If the input data contains characters that can't be represented in the output encoding (such as converting complex UTF8
to ASCII
) those characters will be replaced with the ?
replacement symbol, and the data will be corrupted.
However a lot of people ask "How do I convert a String from one encoding to another", to which a lot of people answer with the following snippet:
String s = new String(source.getBytes(inputEncoding), outputEncoding);
这是完整的公牛****. getBytes(String encoding)
方法返回一个字节数组,其中包含以指定编码编码的字符(如果可能,无效字符会再次转换为?
).具有第二个参数的String构造函数从字节数组创建一个新的String,其中字节采用指定的编码.现在,由于您只是使用source.getBytes(inputEncoding)
来获取这些字节,因此它们未在outputEncoding
中进行 编码(除非编码使用相同的值,这对于abcd
这样的常规"字符是常见的>,但与其他更复杂的字符(如重音字符éêäöñ
)不同.
This is complete bull****. The getBytes(String encoding)
method returns a byte array with the characters encoded in the specified encoding (if possible, again invalid characters are converted to ?
). The String constructor with the 2nd parameter creates a new String from a byte array, where the bytes are in the specified encoding. Now since you just used source.getBytes(inputEncoding)
to get those bytes, they're not encoded in outputEncoding
(except if the encodings use the same values, which is common for "normal" characters like abcd
, but differs with more complex like accented characters éêäöñ
).
那是什么意思?这意味着当您拥有Java String
时,一切都很好. Strings
是unicode,表示您所有的字符都是安全的.当您需要将该String
转换为字节时,就会出现问题,这意味着您需要确定编码.选择与unicode兼容的编码,例如UTF8
,UTF16
等是很好的.这意味着即使您的字符串包含各种奇怪的字符,您的字符仍将是安全的.如果您选择其他编码(US-ASCII
支持最少),则您的String必须仅包含编码支持的字符,否则将导致字节损坏.
So what does this mean? It means that when you have a Java String
, everything is great. Strings
are unicode, meaning that all of your characters are safe. The problem comes when you need to convert that String
to bytes, meaning that you need to decide on an encoding. Choosing a unicode compatible encoding such as UTF8
, UTF16
etc. is great. It means your characters will still be safe even if your String contained all sorts of weird characters. If you choose a different encoding (with US-ASCII
being the least supportive) your String must contain only the characters supported by the encoding, or it will result in corrupted bytes.
现在终于有了一些使用得当和不好的例子.
Now finally some examples of good and bad usage.
String myString = "Feng shui in chinese is 風水";
byte[] bytes1 = myString.getBytes("UTF-8"); // Bytes correct
byte[] bytes2 = myString.getBytes("US-ASCII"); // Last 2 characters are now corrupted (converted to question marks)
String nordic = "Här är några merkkejä";
byte[] bytes3 = nordic.getBytes("UTF-8"); // Bytes correct, "weird" chars take 2 bytes each
byte[] bytes4 = nordic.getBytes("ISO-8859-1"); // Bytes correct, "weird" chars take 1 byte each
String broken = new String(nordic.getBytes("UTF-8"), "ISO-8859-1"); // Contains now "Här är några merkkejä"
最后一个示例说明,即使两种编码都支持北欧字符,但它们使用不同的字节来表示它们.因此,没有将String从一种编码转换为另一种编码的事情,并且永远不要使用损坏的示例.
The last example demonstrates that even though both of the encodings support the nordic characters, they use different bytes to represent them. Therefore there's no such thing as converting a String from one encoding to another, and you should never use the broken example.
还请注意,您应该始终指定使用的编码(同时使用getBytes()
和new String()
),因为您不能相信默认编码始终是您想要的编码.
Also note that you should always specify the encoding used (with both getBytes()
and new String()
), because you can't trust that the default encoding is always the one you want.
最后一个问题是,字符集和编码不是同样的东西,但是它们之间有很大的联系.
As a last issue, Charset and Encoding aren't the same thing, but they're very much related.
¹从技术上讲,字符串在JVM内部的存储方式是采用Java 8以下的UTF-16编码,并且变量编码,但开发人员无需担心.
¹ Technically the way a String is stored internally in the JVM is in UTF-16 encoding up to Java 8, and variable encoding from Java 9 onwards, but the developer doesn't need to care about that.
这篇关于将字符串从一个字符集转换为另一个的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!