Java-Windows上的字符编码混乱 [英] Java - character encoding confusion on Windows
问题描述
我有一个简单的Java程序,接受十六进制并将其转换为ASCII.使用JDK 8,我编译了以下内容:
import java.nio.charset.Charset;导入java.util.Scanner;公共班级{公共静态void main(String [] args){System.out.println("Charset:" + Charset.defaultCharset());Scanner in =新的Scanner(System.in);System.out.print(键入十六进制字符串:");字符串s = in.nextLine();字符串asciiStr = new String();//将字符串拆分为数组String [] hexes = s.split(:");//对于每个十六进制for(String hex:hexes){//将十六进制转换为ASCIISystem.out.print("+" Integer.parseInt(hex,16)+"|" +(char)Integer.parseInt(hex,16));asciiStr + =((char)Integer.parseInt(十六进制,16));}System.out.println("\ nASCII字符串是" + asciiStr);附寄();}}
我正在将 C0:A8:96:FE
的十六进制字符串传递给程序.我主要关心的是 0x96
值,因为它被定义为控制字符(字符范围为128-159).
运行没有任何JVM标志的程序时,输出如下:
字符集:Windows-1252输入十六进制字符串:C0:A8:96:FE192 |À168 |¨150 |?254 |þASCII字符串为?
当我使用JVM标志 -Dfile.encoding = ISO-8859-1
设置字符编码时,输出如下:
字符集:ISO-8859-1输入十六进制字符串:C0:A8:96:FE192 |À168 |¨150 | – 254 |þASCII字符串是-
我想知道为什么,当字符编码设置为ISO-8859-1时,我会得到额外的Windows-1252字符来输入128-159字符吗?这些字符不应在ISO-8859-1中定义,而应在Windows-1252中定义,但此处似乎倒退了.在ISO-8859-1中,我认为应该将 0x96
字符编码为空白字符,但事实并非如此.相反,当Windows-1252编码应将其正确编码为 –
时,会执行此操作.这里有帮助吗?
tl; dr
我的猜测:虽然JVM的默认 Charset
可能是"windows-1252",但您的 System.out
实际上是在使用Unicode.
您说:
当我使用JVM标志-Dfile.encoding = ISO-8859-1设置字符编码时
下面的实验使我怀疑您所做的一切实际上没有影响到 System.out
所使用的字符集.我相信在您的两次运行中,当您以为您的 System.out
使用的是"windows-1252"或"ISO-8859-1",则您的 System.out
实际上使用的是Unicode,很可能是UTF-8.
我希望我知道如何获取 System.out
的 Charset
.
详细信息
实际上,您是在问 Unicode ,而不是 运行时. 现在将每个十六进制文本转换为十六进制整数.我们使用该整数作为Unicode 代码点. 让我们显式设置 运行时. 接下来,我们进行相同的操作,但是要设置 运行时. 我们也可以尝试 Latin-1 ,但仍然可以结果不同. 运行时. 因此您可以看到,当对我们包装的 我的代码的这种行为符合我的期望,并且显然也符合您的期望.这四个十六进制/十进制整数中的两个是Unicode中的字母,而它们都不是字符集或Latin-1.对我来说,唯一神秘的是,十六进制的96十进制150数字具有两种不同的表示形式:一个带UTF-8的空白空间,一个带有Windows-1252的问号,然后是Latin-1下的一个时髦的问号.> 结论:您的 I have a simple Java program that takes in hex and converts it to ASCII.
Using JDK 8, I compiled the following: I am passing in a hex string of The output when I run the program without any JVM flags is the following: The output when I use the JVM flag I'm wondering why, when the character encoding is set to ISO-8859-1, I get the extra Windows-1252 characters for characters 128 - 159? These characters shouldn't be defined in ISO-8859-1, but should be defined in Windows-1252, but it is appearing to be backwards here. In ISO-8859-1, I would think that the My guess: While the default You said: when I use the JVM flag -Dfile.encoding=ISO-8859-1 to set the character encoding My experiments below lead me to suspect that whatever you were doing did not actually affect the character set used by I wish I knew how to get the Actually, you are asking about Unicode rather than ASCII. ASCII has only 128 characters. You said: My main concern is the 0x96 value, because it is defined as a control character (characters in the range of 128 - 159). Actually, that range of control characters starts at 127 in Unicode (and ASCII), not 128. Code point 127 is DELETE character. So 127-159 are control characters. First, let’s split your input string of hex codes. When run. Now convert each hex text into hex integer. We use that integer as a Unicode code point. Rather than rely on some default character encoding, let's explicitly set the When run.
Next, we do the same but for setting When run.
And we can try Latin-1 as well, producing yet a different result. When run.
So you can see that when hard-coding the This behavior of my code matches my expectations, and apparently meets yours as well. Two of those four hex/decimal integer numbers are letters in Unicode while none of them are letters in Windows 1252 character set nor in Latin-1. The only mysterious thing to me is that the hex 96 decimal 150 number has two different representations, an empty space with UTF-8 while a question mark with windows-1252, and then a funky-question-mark under Latin-1. Conclusion: Your Note to the reader: If unfamiliar with character sets and character encoding, I recommend the fun and easy-reading post, The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!). 这篇关于Java-Windows上的字符编码混乱的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!最终列表<字符串>hexInputs = List.of("C0:A8:96:FE" .split(:")));System.out.println("hexInputs =" + hexInputs);
hexInputs = [C0,A8,96,FE]
System.out
的 Charset
,而不是依靠某些默认的字符编码.我不是专家,但是一些网络搜索发现下面的代码,我们在将 System.out
设置为新的 PrintStream
时,将其设置为 Charset 代码>的名称.我找不到获取
PrintStream
的 Charset
的方法,所以 <代码>//UTF-8System.out.println("---------- | UTF-8 | --------------------------";);尝试{PrintStream printStream = new PrintStream(System.out,true,StandardCharsets.UTF_8.name());//"UTF-8".对于(字符串hexInput:hexInputs){int codePoint = Integer.parseInt(hexInput,16);字符串字符串= Character.toString(codePoint);printStream.println("hexInput:" + hexInput +"= codePoint:" + codePoint +"=字符串:[" + string +] = isLetter:" + + Character.isLetter(codePoint)+"名称:" + Character.getName(codePoint));}}catch(UnsupportedEncodingException e){e.printStackTrace();} <代码> ---------- |UTF-8 | --------------------------hexInput:C0 =代码点:192 =字符串:[À] = isLetter:true =名称:带坟墓的拉丁文大写字母AhexInput:A8 =代码点:168 =字符串:[¨] = isLetter:false =名称:DIAERESIShexInput:96 =代码点:150 =字符串:[] = isLetter:false =名称:受保护区域的开始hexInput:FE =代码点:254 =字符串:[þ] = isLetter:true =名称:拉丁文小写字母刺
Windows-1252
"windows-1252"
作为我们包装的 System.out
的 Charset
.在进行包装之前,我们验证当前的JVM上实际上是否可以使用这种字符编码. <代码>//Windows-1252System.out.println("---------- | Windows-1252 | --------------------------";);//验证Windows-1252字符集在当前JVM上是否可用.字符串windows1252CharSetName ="windows-1252";boolean isWindows1252CharsetAvailable = Charset.availableCharsets().keySet().contains(windows1252CharSetName);如果(isWindows1252CharsetAvailable){System.out.println("isWindows1252CharsetAvailable =" + isWindows1252CharsetAvailable);} 别的{System.out.println(" FAIL-没有可用于名称的字符集:" + windows1252CharSetName);}尝试{PrintStream printStream = new PrintStream(System.out,true,windows1252CharSetName);对于(字符串hexInput:hexInputs){int codePoint = Integer.parseInt(hexInput,16);字符串字符串= Character.toString(codePoint);printStream.println("hexInput:" + hexInput +"= codePoint:" + codePoint +"=字符串:[" + string +] = isLetter:" + + Character.isLetter(codePoint)+"名称:" + Character.getName(codePoint));}}catch(UnsupportedEncodingException e){e.printStackTrace();}
<代码> ---------- |Windows-1252 | --------------------------isWindows1252CharsetAvailable = truehexInput:C0 =代码点:192 =字符串:[ ] = isLetter:true =名称:带有首字母的拉丁文大写字母AhexInput:A8 =代码点:168 =字符串:[ ] = isLetter:false =名称:DIAERESIShexInput:96 =代码点:150 =字符串:[?] = isLetter:false =名称:受保护区域的开始hexInput:FE =代码点:254 =字符串:[ ] = isLetter:true =名称:拉丁文小写字母刺
Latin-1
<代码>//ISO-8859-1System.out.println("---------- | Latin-1 | --------------------------";);//验证字符集在当前JVM上可用.字符串latin1CharsetName ="ISO-8859-1";//也称为"Latin-1".boolean isLatin1CharsetNameAvailable = Charset.availableCharsets().keySet().contains(latin1CharsetName);如果(isLatin1CharsetNameAvailable){System.out.println(" isLatin1CharsetNameAvailable =" + isLatin1CharsetNameAvailable);} 别的{System.out.println(" FAIL-没有可用于名称的字符集:" + latin1CharsetName);}尝试{PrintStream printStream = new PrintStream(System.out,true,latin1CharsetName);对于(字符串hexInput:hexInputs){int codePoint = Integer.parseInt(hexInput,16);字符串字符串= Character.toString(codePoint);printStream.println("hexInput:" + hexInput +"= codePoint:" + codePoint +"=字符串:[" + string +] = isLetter:" + + Character.isLetter(codePoint)+"名称:" + Character.getName(codePoint));}}catch(UnsupportedEncodingException e){e.printStackTrace();}
<代码> ---------- |拉丁文1 | --------------------------isLatin1CharsetNameAvailable = truehexInput:C0 =代码点:192 =字符串:[ ] = isLetter:true =名称:带有首字母的拉丁文大写字母AhexInput:A8 =代码点:168 =字符串:[ ] = isLetter:false =名称:DIAERESIShexInput:96 =代码点:150 =字符串:[ ] = isLetter:false =名称:受保护区域的开始hexInput:FE =代码点:254 =字符串:[ ] = isLetter:true =名称:拉丁文小写字母刺
结论
System.out
的 Charset
进行硬编码时,我们确实看到了不同.使用UTF-8,我们得到的是实际字符 [À],[¨],[],[þ]
,而使用Windows-1252时,我们得到了三个时髦的问号字符和一个常规的问号, [ ],[ ],[?],[ ]
.请记住,我们在代码中添加了方括号. System.out
未使用您认为正在使用的 Charset
.我怀疑 JVM 是JVM的默认 Charset
可能被命名为"windows-1252",则您的 System.out
实际上是Unicode字符集,可能带有 每个软件开发人员的绝对最低限度,肯定必须了解Unicode和字符集(无借口!) .import java.nio.charset.Charset;
import java.util.Scanner;
public class Main
{
public static void main(String[] args)
{
System.out.println("Charset: " + Charset.defaultCharset());
Scanner in = new Scanner(System.in);
System.out.print("Type a HEX string: ");
String s = in.nextLine();
String asciiStr = new String();
// Split the string into an array
String[] hexes = s.split(":");
// For each hex
for (String hex : hexes) {
// Translate the hex to ASCII
System.out.print(" " + Integer.parseInt(hex, 16) + "|" + (char)Integer.parseInt(hex, 16));
asciiStr += ((char) Integer.parseInt(hex, 16));
}
System.out.println("\nthe ASCII string is " + asciiStr);
in.close();
}
}
C0:A8:96:FE
to the program. My main concern is the 0x96
value, because it is defined as a control character (characters in the range of 128 - 159).Charset: windows-1252
Type a HEX string: C0:A8:96:FE
192|À 168|¨ 150|? 254|þ
the ASCII string is À¨?þ
-Dfile.encoding=ISO-8859-1
to set the character encoding appears to be the following:Charset: ISO-8859-1
Type a HEX string: C0:A8:96:FE
192|À 168|¨ 150|– 254|þ
the ASCII string is À¨–þ
0x96
character is supposed to be encoded as a blank character, but that is not the case. Instead, the Windows-1252 encoding does this, when it should properly encode it as a –
. Any help here?tl;dr
Charset
of your JVM may be "windows-1252", your System.out
is actually using Unicode.
System.out
. I believe that in both your runs, when you thought your System.out
was using "windows-1252" or "ISO-8859-1", your System.out
was in fact using Unicode, likely UTF-8.Charset
of System.out
.Details
final List < String > hexInputs = List.of( "C0:A8:96:FE".split( ":" ) );
System.out.println( "hexInputs = " + hexInputs );
hexInputs = [C0, A8, 96, FE]
Charset
of our System.out
. I'm no expert on this, but some web-searching found the code below where we wrap System.out
in a new PrintStream
while setting a Charset
by its name. I could not find a way to get the Charset
of a PrintStream
, so I asked.UTF-8
// UTF-8
System.out.println( "----------| UTF-8 |--------------------------" );
try
{
PrintStream printStream = new PrintStream( System.out , true , StandardCharsets.UTF_8.name() ); // "UTF-8".
for ( String hexInput : hexInputs )
{
int codePoint = Integer.parseInt( hexInput , 16 );
String string = Character.toString( codePoint );
printStream.println( "hexInput: " + hexInput + " = codePoint: " + codePoint + " = string: [" + string + "] = isLetter: " + Character.isLetter( codePoint ) + " = name: " + Character.getName( codePoint ) );
}
}
catch ( UnsupportedEncodingException e )
{
e.printStackTrace();
}
----------| UTF-8 |--------------------------
hexInput: C0 = codePoint: 192 = string: [À] = isLetter: true = name: LATIN CAPITAL LETTER A WITH GRAVE
hexInput: A8 = codePoint: 168 = string: [¨] = isLetter: false = name: DIAERESIS
hexInput: 96 = codePoint: 150 = string: [] = isLetter: false = name: START OF GUARDED AREA
hexInput: FE = codePoint: 254 = string: [þ] = isLetter: true = name: LATIN SMALL LETTER THORN
Windows-1252
"windows-1252"
as the Charset
of our wrapped System.out
. Before doing the wrapping, we verify that such a character encoding is actually available on our current JVM. // windows-1252
System.out.println( "----------| windows-1252 |--------------------------" );
// Verify windows-1252 charset is available on the current JVM.
String windows1252CharSetName = "windows-1252";
boolean isWindows1252CharsetAvailable = Charset.availableCharsets().keySet().contains( windows1252CharSetName );
if ( isWindows1252CharsetAvailable )
{
System.out.println( "isWindows1252CharsetAvailable = " + isWindows1252CharsetAvailable );
} else
{
System.out.println( "FAIL - No charset available for name: " + windows1252CharSetName );
}
try
{
PrintStream printStream = new PrintStream( System.out , true , windows1252CharSetName );
for ( String hexInput : hexInputs )
{
int codePoint = Integer.parseInt( hexInput , 16 );
String string = Character.toString( codePoint );
printStream.println( "hexInput: " + hexInput + " = codePoint: " + codePoint + " = string: [" + string + "] = isLetter: " + Character.isLetter( codePoint ) + " = name: " + Character.getName( codePoint ) );
}
}
catch ( UnsupportedEncodingException e )
{
e.printStackTrace();
}
----------| windows-1252 |--------------------------
isWindows1252CharsetAvailable = true
hexInput: C0 = codePoint: 192 = string: [�] = isLetter: true = name: LATIN CAPITAL LETTER A WITH GRAVE
hexInput: A8 = codePoint: 168 = string: [�] = isLetter: false = name: DIAERESIS
hexInput: 96 = codePoint: 150 = string: [?] = isLetter: false = name: START OF GUARDED AREA
hexInput: FE = codePoint: 254 = string: [�] = isLetter: true = name: LATIN SMALL LETTER THORN
Latin-1
// ISO-8859-1
System.out.println( "----------| Latin-1 |--------------------------" );
// Verify that charset is available on the current JVM.
String latin1CharsetName = "ISO-8859-1"; // Also known as "Latin-1".
boolean isLatin1CharsetNameAvailable = Charset.availableCharsets().keySet().contains( latin1CharsetName );
if ( isLatin1CharsetNameAvailable )
{
System.out.println( "isLatin1CharsetNameAvailable = " + isLatin1CharsetNameAvailable );
} else
{
System.out.println( "FAIL - No charset available for name: " + latin1CharsetName );
}
try
{
PrintStream printStream = new PrintStream( System.out , true , latin1CharsetName );
for ( String hexInput : hexInputs )
{
int codePoint = Integer.parseInt( hexInput , 16 );
String string = Character.toString( codePoint );
printStream.println( "hexInput: " + hexInput + " = codePoint: " + codePoint + " = string: [" + string + "] = isLetter: " + Character.isLetter( codePoint ) + " = name: " + Character.getName( codePoint ) );
}
}
catch ( UnsupportedEncodingException e )
{
e.printStackTrace();
}
----------| Latin-1 |--------------------------
isLatin1CharsetNameAvailable = true
hexInput: C0 = codePoint: 192 = string: [�] = isLetter: true = name: LATIN CAPITAL LETTER A WITH GRAVE
hexInput: A8 = codePoint: 168 = string: [�] = isLetter: false = name: DIAERESIS
hexInput: 96 = codePoint: 150 = string: [�] = isLetter: false = name: START OF GUARDED AREA
hexInput: FE = codePoint: 254 = string: [�] = isLetter: true = name: LATIN SMALL LETTER THORN
Conclusion
Charset
of our wrapped System.out
, we do indeed see a difference. With UTF-8, we get actual characters [À], [¨], [], [þ]
whereas with windows-1252 we get three funky question mark characters and one regular question mark, [�], [�], [?], [�]
. Remember that we added the square brackets in our code.System.out
is not using the Charset
that you think it is using. I suspect that while the JVM’s default Charset
of your JVM may be named "windows-1252", your System.out
is actually the Unicode character set, likely with UTF-8 encoding.