Java UTF-8编码产生错误的输出 [英] Java UTF-8 encoding produces incorrect output
问题描述
在Java中,我一直在尝试使用UTF-8编码将String写入文件,以后将由另一种以其他编程语言编写的程序读取该字符串.这样做时,我注意到将String编码为字节数组时创建的字节似乎没有正确的字节值.
In Java, I've been trying to write a String to a file using UTF-8 encoding which will later be read by another program written in a different programming language. While doing so I noticed that the bytes created when encoding a String into a byte array didn't seem to have the correct byte values.
我将问题缩小为符号£",当编码为UTF-8时似乎会产生不正确的字节
I narrowed down the problem to the symbol "£" which seems to produce incorrect bytes when encoded to UTF-8
byte[] byteArray = "£".getBytes(Charset.forName("UTF-8"));
// Print out the Byte Array of the UTF-8 converted string
// Upcast byte values to print the bytes as unsigned
for (byte signedByte : byteArray) {
System.out.print((signedByte & 0xFF) + " ");
}
这将输出6个字节,其十进制值为239190130239239189163,以十六进制表示:ef be 82 ef bd a3
This outputs 6 bytes with the decimal values: 239 190 130 239 189 163, in hex this is: ef be 82 ef bd a3
http://www.utf8-chartable.de/但是,十六进制中的"是:c2 a3,输出应为:194163
http://www.utf8-chartable.de/ however says that the values for "£" in hex is: c2 a3, the output should then be: 194 163
其他字符串在编码为UTF-8时似乎会产生正确的字节,因此我想知道Java为什么要为£"生成这6个字节,以及如何使用UTF-将字符串正确转换为字节数组?8种编码方式
Other strings seem to produce correct bytes when encoded as UTF-8, so I'm wondering why Java is producing these 6 bytes for "£", and how I should go about properly converting by Strings to byte arrays using UTF-8 encoding
我也尝试过
OutputStreamWriter out = new OutputStreamWriter(new FileOutputStream(outputFile), "UTF-8");
out.write("£");
out.close();
但这产生了相同的6个字节
but this produced the same 6 bytes
推荐答案
我怀疑问题是您正在使用一种以一种编码形式将其写出来的编辑器在Java代码中使用字符串文字.编译,而未指定相同的编码.换句话说,我怀疑您的£"
字符串实际上根本不是单个井号.
I suspect the problem is that you're using a string literal in your Java code using an editor which writes it out in one encoding - but then you're compiling without specifying the same encoding. In other words, I suspect that your "£"
string is not actually a single pound sign at all.
这应该很容易验证.例如:
This should be easy to validate. For example:
char[] chars = "£".toCharArray();
for (char c : chars) {
System.out.println((int) c);
}
要将其排除在等式之外,您可以使用纯ASCII表示形式(使用Unicode转义序列)来指定字符串:
To take this out of the equation, you can specify the string using a pure-ASCII representation using a Unicode escape sequence:
String pound = "\u00a3";
// Now encode as before
我确定您会得到正确的字节.例如:
I'm sure you'll then get the right bytes. For example:
import java.nio.charset.Charset;
class Test {
public static void main(String[] args) throws Exception {
String pound = "\u00a3";
byte[] bytes = pound.getBytes(Charset.forName("UTF-8"));
for (byte b : bytes) {
System.out.println(b & 0xff); // 194, 163
}
}
}
这篇关于Java UTF-8编码产生错误的输出的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!