Java-Windows上的字符编码混乱 [英] Java - character encoding confusion on Windows

查看:76
本文介绍了Java-Windows上的字符编码混乱的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个简单的Java程序,接受十六进制并将其转换为ASCII.使用JDK 8,我编译了以下内容:

  import java.nio.charset.Charset;导入java.util.Scanner;公共班级{公共静态void main(String [] args){System.out.println("Charset:" + Charset.defaultCharset());Scanner in =新的Scanner(System.in);System.out.print(键入十六进制字符串:");字符串s = in.nextLine();字符串asciiStr = new String();//将字符串拆分为数组String [] hexes = s.split(:");//对于每个十六进制for(String hex:hexes){//将十六进制转换为ASCIISystem.out.print("+" Integer.parseInt(hex,16)+"|" +(char)Integer.parseInt(hex,16));asciiStr + =((char)Integer.parseInt(十六进制,16));}System.out.println("\ nASCII字符串是" + asciiStr);附寄();}} 

我正在将 C0:A8:96:FE 的十六进制字符串传递给程序.我主要关心的是 0x96 值,因为它被定义为控制字符(字符范围为128-159).

运行没有任何JVM标志的程序时,输出如下:

 字符集:Windows-1252输入十六进制字符串:C0:A8:96:FE192 |À168 |¨150 |?254 |þASCII字符串为? 

当我使用JVM标志 -Dfile.encoding = ISO-8859-1 设置字符编码时,输出如下:

 字符集:ISO-8859-1输入十六进制字符串:C0:A8:96:FE192 |À168 |¨150 | – 254 |þASCII字符串是- 

我想知道为什么,当字符编码设置为ISO-8859-1时,我会得到额外的Windows-1252字符来输入128-159字符吗?这些字符不应在ISO-8859-1中定义,而应在Windows-1252中定义,但此处似乎倒退了.在ISO-8859-1中,我认为应该将 0x96 字符编码为空白字符,但事实并非如此.相反,当Windows-1252编码应将其正确编码为时,会执行此操作.这里有帮助吗?

解决方案

tl; dr

我的猜测:虽然JVM的默认 Charset 可能是"windows-1252",但您的 System.out 实际上是在使用Unicode.

您说:

当我使用JVM标志-Dfile.encoding = ISO-8859-1设置字符编码时

下面的实验使我怀疑您所做的一切实际上没有影响到 System.out 所使用的字符集.我相信在您的两次运行中,当您以为您的 System.out 使用的是"windows-1252"或"ISO-8859-1",则您的 System.out 实际上使用的是Unicode,很可能是UTF-8.

我希望我知道如何获取 System.out Charset .

详细信息

实际上,您是在问 Unicode ,而不是 最终列表<字符串>hexInputs = List.of("C0:A8:96:FE" .split(:")));System.out.println("hexInputs =" + hexInputs);

运行时.

  hexInputs = [C0,A8,96,FE] 

现在将每个十六进制文本转换为十六进制整数.我们使用该整数作为Unicode 代码点.

让我们显式设置 System.out Charset ,而不是依靠某些默认的字符编码.我不是专家,但是一些网络搜索发现下面的代码,我们在将 System.out 设置为新的 PrintStream 时,将其设置为 Charset 代码>的名称.我找不到获取 PrintStream Charset 的方法,所以 <代码>//UTF-8System.out.println("---------- | UTF-8 | --------------------------";);尝试{PrintStream printStream = new PrintStream(System.out,true,StandardCharsets.UTF_8.name());//"UTF-8".对于(字符串hexInput:hexInputs){int codePoint = Integer.parseInt(hexInput,16);字符串字符串= Character.toString(codePoint);printStream.println("hexInput:" + hexInput +"= codePoint:" + codePoint +"=字符串:[" + string +] = isLetter:" + + Character.isLetter(codePoint)+"名称:" + Character.getName(codePoint));}}catch(UnsupportedEncodingException e){e.printStackTrace();}

运行时.

 <代码> ---------- |UTF-8 | --------------------------hexInput:C0 =代码点:192 =字符串:[À] = isLetter:true =名称:带坟墓的拉丁文大写字母AhexInput:A8 =代码点:168 =字符串:[¨] = isLetter:false =名称:DIAERESIShexInput:96 =代码点:150 =字符串:[] = isLetter:false =名称:受保护区域的开始hexInput:FE =代码点:254 =字符串:[þ] = isLetter:true =名称:拉丁文小写字母刺 

Windows-1252

接下来,我们进行相同的操作,但是要设置 "windows-1252" 作为我们包装的 System.out Charset .在进行包装之前,我们验证当前的JVM上实际上是否可以使用这种字符编码.

 <代码>//Windows-1252System.out.println("---------- | Windows-1252 | --------------------------";);//验证Windows-1252字符集在当前JVM上是否可用.字符串windows1252CharSetName ="windows-1252";boolean isWindows1252CharsetAvailable = Charset.availableCharsets().keySet().contains(windows1252CharSetName);如果(isWindows1252CharsetAvailable){System.out.println("isWindows1252CharsetAvailable =" + isWindows1252CharsetAvailable);} 别的{System.out.println(" FAIL-没有可用于名称的字符集:" + windows1252CharSetName);}尝试{PrintStream printStream = new PrintStream(System.out,true,windows1252CharSetName);对于(字符串hexInput:hexInputs){int codePoint = Integer.parseInt(hexInput,16);字符串字符串= Character.toString(codePoint);printStream.println("hexInput:" + hexInput +"= codePoint:" + codePoint +"=字符串:[" + string +] = isLetter:" + + Character.isLetter(codePoint)+"名称:" + Character.getName(codePoint));}}catch(UnsupportedEncodingException e){e.printStackTrace();} 

运行时.

 <代码> ---------- |Windows-1252 | --------------------------isWindows1252CharsetAvailable = truehexInput:C0 =代码点:192 =字符串:[ ] = isLetter:true =名称:带有首字母的拉丁文大写字母AhexInput:A8 =代码点:168 =字符串:[ ] = isLetter:false =名称:DIAERESIShexInput:96 =代码点:150 =字符串:[?] = isLetter:false =名称:受保护区域的开始hexInput:FE =代码点:254 =字符串:[ ] = isLetter:true =名称:拉丁文小写字母刺 

Latin-1

我们也可以尝试 Latin-1 ,但仍然可以结果不同.

 <代码>//ISO-8859-1System.out.println("---------- | Latin-1 | --------------------------";);//验证字符集在当前JVM上可用.字符串latin1CharsetName ="ISO-8859-1";//也称为"Latin-1".boolean isLatin1CharsetNameAvailable = Charset.availableCharsets().keySet().contains(latin1CharsetName);如果(isLatin1CharsetNameAvailable){System.out.println(" isLatin1CharsetNameAvailable =" + isLatin1CharsetNameAvailable);} 别的{System.out.println(" FAIL-没有可用于名称的字符集:" + latin1CharsetName);}尝试{PrintStream printStream = new PrintStream(System.out,true,latin1CharsetName);对于(字符串hexInput:hexInputs){int codePoint = Integer.parseInt(hexInput,16);字符串字符串= Character.toString(codePoint);printStream.println("hexInput:" + hexInput +"= codePoint:" + codePoint +"=字符串:[" + string +] = isLetter:" + + Character.isLetter(codePoint)+"名称:" + Character.getName(codePoint));}}catch(UnsupportedEncodingException e){e.printStackTrace();} 

运行时.

 <代码> ---------- |拉丁文1 | --------------------------isLatin1CharsetNameAvailable = truehexInput:C0 =代码点:192 =字符串:[ ] = isLetter:true =名称:带有首字母的拉丁文大写字母AhexInput:A8 =代码点:168 =字符串:[ ] = isLetter:false =名称:DIAERESIShexInput:96 =代码点:150 =字符串:[ ] = isLetter:false =名称:受保护区域的开始hexInput:FE =代码点:254 =字符串:[ ] = isLetter:true =名称:拉丁文小写字母刺 

结论

因此您可以看到,当对我们包装的 System.out Charset 进行硬编码时,我们确实看到了不同.使用UTF-8,我们得到的是实际字符 [À],[¨],[],[þ] ,而使用Windows-1252时,我们得到了三个时髦的问号字符和一个常规的问号, [ ],[ ],[?],[ ] .请记住,我们在代码中添加了方括号.

我的代码的这种行为符合我的期望,并且显然也符合您的期望.这四个十六进制/十进制整数中的两个是Unicode中的字母,而它们都不是字符集或Latin-1.对我来说,唯一神秘的是,十六进制的96十进制150数字具有两种不同的表示形式:一个带UTF-8的空白空间,一个带有Windows-1252的问号,然后是Latin-1下的一个时髦的问号.>

结论:您的 System.out 未使用您认为正在使用的 Charset .我怀疑 JVM 是JVM的默认 Charset 可能被命名为"windows-1252",则您的 System.out 实际上是Unicode字符集,可能带有 解决方案

tl;dr

My guess: While the default Charset of your JVM may be "windows-1252", your System.out is actually using Unicode.

You said:

when I use the JVM flag -Dfile.encoding=ISO-8859-1 to set the character encoding

My experiments below lead me to suspect that whatever you were doing did not actually affect the character set used by System.out. I believe that in both your runs, when you thought your System.out was using "windows-1252" or "ISO-8859-1", your System.out was in fact using Unicode, likely UTF-8.

I wish I knew how to get the Charset of System.out.

Details

Actually, you are asking about Unicode rather than ASCII. ASCII has only 128 characters.

You said:

My main concern is the 0x96 value, because it is defined as a control character (characters in the range of 128 - 159).

Actually, that range of control characters starts at 127 in Unicode (and ASCII), not 128. Code point 127 is DELETE character. So 127-159 are control characters.

First, let’s split your input string of hex codes.

        final List < String > hexInputs = List.of( "C0:A8:96:FE".split( ":" ) );
        System.out.println( "hexInputs = " + hexInputs );

When run.

hexInputs = [C0, A8, 96, FE]

Now convert each hex text into hex integer. We use that integer as a Unicode code point.

Rather than rely on some default character encoding, let's explicitly set the Charset of our System.out. I'm no expert on this, but some web-searching found the code below where we wrap System.out in a new PrintStream while setting a Charset by its name. I could not find a way to get the Charset of a PrintStream, so I asked.

UTF-8

        // UTF-8
        System.out.println( "----------|  UTF-8  |--------------------------" );
        try
        {
            PrintStream printStream = new PrintStream( System.out , true , StandardCharsets.UTF_8.name() ); // "UTF-8".

            for ( String hexInput : hexInputs )
            {
                int codePoint = Integer.parseInt( hexInput , 16 );
                String string = Character.toString( codePoint );
                printStream.println( "hexInput: " + hexInput + " = codePoint: " + codePoint + " = string: [" + string + "] = isLetter: " + Character.isLetter( codePoint ) + " = name: " + Character.getName( codePoint ) );
            }
        }
        catch ( UnsupportedEncodingException e )
        {
            e.printStackTrace();
        }

When run.

----------|  UTF-8  |--------------------------
hexInput: C0 = codePoint: 192 = string: [À] = isLetter: true = name: LATIN CAPITAL LETTER A WITH GRAVE
hexInput: A8 = codePoint: 168 = string: [¨] = isLetter: false = name: DIAERESIS
hexInput: 96 = codePoint: 150 = string: [] = isLetter: false = name: START OF GUARDED AREA
hexInput: FE = codePoint: 254 = string: [þ] = isLetter: true = name: LATIN SMALL LETTER THORN

Windows-1252

Next, we do the same but for setting "windows-1252" as the Charset of our wrapped System.out. Before doing the wrapping, we verify that such a character encoding is actually available on our current JVM.

        // windows-1252
        System.out.println( "----------|  windows-1252  |--------------------------" );

        // Verify windows-1252 charset is available on the current JVM.
        String windows1252CharSetName = "windows-1252";
        boolean isWindows1252CharsetAvailable = Charset.availableCharsets().keySet().contains( windows1252CharSetName );
        if ( isWindows1252CharsetAvailable )
        {
            System.out.println( "isWindows1252CharsetAvailable = " + isWindows1252CharsetAvailable );
        } else
        {
            System.out.println( "FAIL - No charset available for name: " + windows1252CharSetName );
        }

        try
        {
            PrintStream printStream = new PrintStream( System.out , true , windows1252CharSetName );

            for ( String hexInput : hexInputs )
            {
                int codePoint = Integer.parseInt( hexInput , 16 );
                String string = Character.toString( codePoint );
                printStream.println( "hexInput: " + hexInput + " = codePoint: " + codePoint + " = string: [" + string + "] = isLetter: " + Character.isLetter( codePoint ) + " = name: " + Character.getName( codePoint ) );
            }
        }
        catch ( UnsupportedEncodingException e )
        {
            e.printStackTrace();
        }

When run.

----------|  windows-1252  |--------------------------
isWindows1252CharsetAvailable = true
hexInput: C0 = codePoint: 192 = string: [�] = isLetter: true = name: LATIN CAPITAL LETTER A WITH GRAVE
hexInput: A8 = codePoint: 168 = string: [�] = isLetter: false = name: DIAERESIS
hexInput: 96 = codePoint: 150 = string: [?] = isLetter: false = name: START OF GUARDED AREA
hexInput: FE = codePoint: 254 = string: [�] = isLetter: true = name: LATIN SMALL LETTER THORN

Latin-1

And we can try Latin-1 as well, producing yet a different result.

        // ISO-8859-1
        System.out.println( "----------|  Latin-1  |--------------------------" );

        // Verify that  charset is available on the current JVM.
        String latin1CharsetName = "ISO-8859-1"; // Also known as "Latin-1".
        boolean isLatin1CharsetNameAvailable = Charset.availableCharsets().keySet().contains( latin1CharsetName );
        if ( isLatin1CharsetNameAvailable )
        {
            System.out.println( "isLatin1CharsetNameAvailable = " + isLatin1CharsetNameAvailable );
        } else
        {
            System.out.println( "FAIL - No charset available for name: " + latin1CharsetName );
        }

        try
        {
            PrintStream printStream = new PrintStream( System.out , true , latin1CharsetName );

            for ( String hexInput : hexInputs )
            {
                int codePoint = Integer.parseInt( hexInput , 16 );
                String string = Character.toString( codePoint );
                printStream.println( "hexInput: " + hexInput + " = codePoint: " + codePoint + " = string: [" + string + "] = isLetter: " + Character.isLetter( codePoint ) + " = name: " + Character.getName( codePoint ) );
            }
        }
        catch ( UnsupportedEncodingException e )
        {
            e.printStackTrace();
        }

When run.

----------|  Latin-1  |--------------------------
isLatin1CharsetNameAvailable = true
hexInput: C0 = codePoint: 192 = string: [�] = isLetter: true = name: LATIN CAPITAL LETTER A WITH GRAVE
hexInput: A8 = codePoint: 168 = string: [�] = isLetter: false = name: DIAERESIS
hexInput: 96 = codePoint: 150 = string: [�] = isLetter: false = name: START OF GUARDED AREA
hexInput: FE = codePoint: 254 = string: [�] = isLetter: true = name: LATIN SMALL LETTER THORN

Conclusion

So you can see that when hard-coding the Charset of our wrapped System.out, we do indeed see a difference. With UTF-8, we get actual characters [À], [¨], [], [þ] whereas with windows-1252 we get three funky question mark characters and one regular question mark, [�], [�], [?], [�]. Remember that we added the square brackets in our code.

This behavior of my code matches my expectations, and apparently meets yours as well. Two of those four hex/decimal integer numbers are letters in Unicode while none of them are letters in Windows 1252 character set nor in Latin-1. The only mysterious thing to me is that the hex 96 decimal 150 number has two different representations, an empty space with UTF-8 while a question mark with windows-1252, and then a funky-question-mark under Latin-1.

Conclusion: Your System.out is not using the Charset that you think it is using. I suspect that while the JVM’s default Charset of your JVM may be named "windows-1252", your System.out is actually the Unicode character set, likely with UTF-8 encoding.


Note to the reader: If unfamiliar with character sets and character encoding, I recommend the fun and easy-reading post, The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!).

这篇关于Java-Windows上的字符编码混乱的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆