如何查找Java中的默认字符集/编码? [英] How to Find the Default Charset/Encoding in Java?

查看:96
本文介绍了如何查找Java中的默认字符集/编码?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

显然的答案是使用 Charset.defaultCharset(),但我们最近发现,这可能不是正确的答案。我被告知,结果不同于java.io类在几个场合使用的真正的默认字符集。看起来像Java保留了2套默认字符集。有人对这个问题有任何见解吗?



我们能够重现一个失败的情况。这是一种用户错误,但它仍然可能暴露所有其他问题的根本原因。这里是代码,

  public class CharSetTest {

public static void main(String [] args ){
System.out.println(Default Charset =+ Charset.defaultCharset());
System.setProperty(file.encoding,Latin-1);
System.out.println(file.encoding =+ System.getProperty(file.encoding));
System.out.println(Default Charset =+ Charset.defaultCharset());
System.out.println(Use中的默认字符集=+ getDefaultCharSet());
}

private static String getDefaultCharSet(){
OutputStreamWriter writer = new OutputStreamWriter(new ByteArrayOutputStream());
String enc = writer.getEncoding();
return enc
}
}



我们的服务器需要在Latin-1中使用默认字符集在传统协议中具有一些混合编码(ANSI / Latin-1 / UTF-8)。所有我们的服务器都运行这个JVM参数,

  -Dfile.encoding = ISO-8859-1 



这是Java 5上的结果,

 默认Charset = ISO-8859-1 
file.encoding = Latin-1
默认Charset = UTF-8
使用中的默认字符集= ISO8859_1

有人尝试通过在代码中设置file.encoding来更改编​​码运行时。我们都知道这不行。然而,这显然抛出了defaultCharset(),但它不影响OutputStreamWriter使用的真正的默认字符集。



这是一个错误或功能吗?



编辑:接受的答案显示问题的根本原因。基本上,你不能信任Java5中的defaultCharset(),这不是I / O类使用的默认编码。看起来像Java 6改正了这个问题。

解决方案

这真的很奇怪...一旦设置,默认的字符集缓存,当类在内存中时不会改变。使用 System.setProperty(file.encoding,Latin-1); 设置file.encoding c>什么都不做。每次 Charset.defaultCharset()被调用时,它返回缓存的字符集。



这里是我的结果:

 默认Charset = ISO-8859 -1 
file.encoding = Latin-1
默认Charset = ISO-8859-1
使用中的默认字符集= ISO8859_1


(更新)



好的。我使用JVM 1.5重现了你的错误。



查看1.5的源代码,缓存的默认字符集没有设置。我不知道这是否是一个错误,但1.6更改此实现并使用缓存的字符集:



JVM 1.5:

  public static Charset defaultCharset(){
synchronized(Charset.class){
if(defaultCharset == null){
java .security.PrivilegedAction pa =
new GetPropertyAction(file.encoding);
String csn =(String)AccessController.doPrivileged(pa);
Charset cs = lookup(csn);
if(cs!= null)
return cs;
return forName(UTF-8);
}
return defaultCharset;
}
}

JVM 1.6:

  public static Charset defaultCharset(){
if(defaultCharset == null){
synchronized(Charset.class){
java.security.PrivilegedAction pa =
new GetPropertyAction(file.encoding);
String csn =(String)AccessController.doPrivileged(pa);
Charset cs = lookup(csn);
if(cs!= null)
defaultCharset = cs;
else
defaultCharset = forName(UTF-8);
}
}
return defaultCharset;
}



当您将文件编码设置为 file.encoding = Latin-1 下次调用 Charset.defaultCharset()时,会发生什么,因为缓存的默认字符集未设置,将尝试为名称 Latin-1 查找适当的字符集。找不到此名称,因为它不正确,并返回默认 UTF-8



IO类如 OutputStreamWriter 返回意想不到的结果,

执行 sun.nio.cs.StreamEncoder (这些IO类使用witch)对于JVM 1.5和JVM 1.6也是不同的。 JVM 1.6实现基于 Charset.defaultCharset()方法来获取默认编码(如果没有提供给IO类)。 JVM 1.5实现使用不同的方法 Converters.getDefaultEncodingName(); 来获取默认字符集。此方法使用它自己的缓存JVM初始化时设置的默认字符集:



JVM 1.6:

  public static StreamEncoder forOutputStreamWriter(OutputStream out,
Object lock,
String charsetName)
throws UnsupportedEncodingException
{
String csn = charsetName ;
if(csn == null)
csn = Charset.defaultCharset()。name();
try {
if(Charset.isSupported(csn))
返回新的StreamEncoder(out,lock,Charset.forName(csn));
} catch(IllegalCharsetNameException x){}
throw new UnsupportedEncodingException(csn);
}

JVM 1.5:

  public static StreamEncoder forOutputStreamWriter(OutputStream out,
对象锁定,
String charsetName)
throws UnsupportedEncodingException
{
String csn = charsetName;
if(csn == null)
csn = Converters.getDefaultEncodingName();
if(!Converters.isCached(Converters.CHAR_TO_BYTE,csn)){
try {
if(Charset.isSupported(csn))
return new CharsetSE(out, Charset.forName(csn));
} catch(IllegalCharsetNameException x){}
}
return new ConverterSE(out,lock,csn);
}

但我同意这些意见。您不应该依赖此属性。这是一个实现细节。


The obvious answer is to use Charset.defaultCharset() but we recently found out that this might not be the right answer. I was told that the result is different from real default charset used by java.io classes in several occasions. Looks like Java keeps 2 sets of default charset. Does anyone have any insights on this issue?

We were able to reproduce one fail case. It's kind of user error but it may still expose the root cause of all other problems. Here is the code,

public class CharSetTest {

    public static void main(String[] args) {
        System.out.println("Default Charset=" + Charset.defaultCharset());
        System.setProperty("file.encoding", "Latin-1");
        System.out.println("file.encoding=" + System.getProperty("file.encoding"));
        System.out.println("Default Charset=" + Charset.defaultCharset());
        System.out.println("Default Charset in Use=" + getDefaultCharSet());
    }

    private static String getDefaultCharSet() {
        OutputStreamWriter writer = new OutputStreamWriter(new ByteArrayOutputStream());
        String enc = writer.getEncoding();
        return enc;
    }
}

Our server requires default charset in Latin-1 to deal with some mixed encoding (ANSI/Latin-1/UTF-8) in a legacy protocol. So all our servers run with this JVM parameter,

-Dfile.encoding=ISO-8859-1

Here is the result on Java 5,

Default Charset=ISO-8859-1
file.encoding=Latin-1
Default Charset=UTF-8
Default Charset in Use=ISO8859_1

Someone tries to change the encoding runtime by setting the file.encoding in the code. We all know that doesn't work. However, this apparently throws off defaultCharset() but it doesn't affect the real default charset used by OutputStreamWriter.

Is this a bug or feature?

EDIT: The accepted answer shows the root cause of the issue. Basically, you can't trust defaultCharset() in Java 5, which is not the default encoding used by I/O classes. Looks like Java 6 corrects this issue.

解决方案

This is really strange... Once set, the default Charset is cached and it isn't changed while the class is in memory. Setting the "file.encoding" property with System.setProperty("file.encoding", "Latin-1"); does nothing. Every time Charset.defaultCharset() is called it returns the cached charset.

Here are my results:

Default Charset=ISO-8859-1
file.encoding=Latin-1
Default Charset=ISO-8859-1
Default Charset in Use=ISO8859_1

I'm using JVM 1.6 though.

(update)

Ok. I did reproduce your bug with JVM 1.5.

Looking at the source code of 1.5, the cached default charset isn't being set. I don't know if this is a bug or not but 1.6 changes this implementation and uses the cached charset:

JVM 1.5:

public static Charset defaultCharset() {
synchronized (Charset.class) {
    if (defaultCharset == null) {
    java.security.PrivilegedAction pa =
        new GetPropertyAction("file.encoding");
    String csn = (String)AccessController.doPrivileged(pa);
    Charset cs = lookup(csn);
    if (cs != null)
        return cs;
    return forName("UTF-8");
    }
    return defaultCharset;
}
}

JVM 1.6:

public static Charset defaultCharset() {
    if (defaultCharset == null) {
    synchronized (Charset.class) {
    java.security.PrivilegedAction pa =
        new GetPropertyAction("file.encoding");
    String csn = (String)AccessController.doPrivileged(pa);
    Charset cs = lookup(csn);
    if (cs != null)
        defaultCharset = cs;
            else 
        defaultCharset = forName("UTF-8");
        }
}
return defaultCharset;
}

When you set the file encoding to file.encoding=Latin-1 the next time you call Charset.defaultCharset(), what happens is, because the cached default charset isn't set, it will try to find the appropriate charset for the name Latin-1. This name isn't found, because it's incorrect, and returns the default UTF-8.

As for why the IO classes such as OutputStreamWriter return an unexpected result,
the implementation of sun.nio.cs.StreamEncoder (witch is used by these IO classes) is different as well for JVM 1.5 and JVM 1.6. The JVM 1.6 implementation is based in the Charset.defaultCharset() method to get the default encoding, if one is not provided to IO classes. The JVM 1.5 implementation uses a different method Converters.getDefaultEncodingName(); to get the default charset. This method uses it's own cache of the default charset that is set upon JVM initialization:

JVM 1.6:

   public static StreamEncoder forOutputStreamWriter(OutputStream out,
                                                     Object lock,
                                                     String charsetName)
       throws UnsupportedEncodingException
   {
       String csn = charsetName;
       if (csn == null)
           csn = Charset.defaultCharset().name();
       try {
           if (Charset.isSupported(csn))
               return new StreamEncoder(out, lock, Charset.forName(csn));
       } catch (IllegalCharsetNameException x) { }
       throw new UnsupportedEncodingException (csn);
   }

JVM 1.5:

public static StreamEncoder forOutputStreamWriter(OutputStream out,
                          Object lock,
                          String charsetName)
throws UnsupportedEncodingException
{
String csn = charsetName;
if (csn == null)
    csn = Converters.getDefaultEncodingName();
if (!Converters.isCached(Converters.CHAR_TO_BYTE, csn)) {
    try {
    if (Charset.isSupported(csn))
        return new CharsetSE(out, lock, Charset.forName(csn));
    } catch (IllegalCharsetNameException x) { }
}
return new ConverterSE(out, lock, csn);
}

But I agree with the comments. You shouldn't rely on this property. It's an implementation detail.

这篇关于如何查找Java中的默认字符集/编码?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆