如何在 Java 中打开包含重音符号的文件? [英] How can I open files containing accents in Java?

查看:22
本文介绍了如何在 Java 中打开包含重音符号的文件?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

(编辑以澄清并添加一些代码)

你好,我们需要解析来自世界各地的用户发送的数据.我们的 Linux 系统的默认语言环境是 en_US.UTF-8.但是,我们经常收到名称中带有变音符号的文件,例如special_á_ã_è_characters.doc".虽然操作系统可以很好地处理这些文件,并且一个 strace 显示操作系统将正确的文件名传递给 Java 程序,但 Java 会修改这些名称并在尝试打开它们时抛出找不到文件"io 异常.

Hello, We have a requirement to parse data sent from users all over the world. Our Linux systems have a default locale of en_US.UTF-8. However, we often receive files with diacritical marks in their names such as "special_á_ã_è_characters.doc". Though the OS can deal with these files fine, and an strace shows the OS passing the correct file name to the Java program, Java munges the names and throws a "file not found" io exception trying to open them.

这个简单的程序可以说明问题:

This simple program can illustrate the issue:

import java.io.*;
import java.text.*;

public class load_i18n
{
  public static void main( String [] args ) {
    File actual = new File(".");
    for( File f : actual.listFiles()){
      System.out.println( f.getName() );
    }
  }
}

在包含文件 special_á_ã_è_characters.doc 和默认美国英语语言环境的目录中运行此程序:

Running this program in a directory containing the file special_á_ã_è_characters.doc and the default US English locale gives:

special_�_�_�_characters.doc

special_�_�_�_characters.doc

通过 export LANG=es_ES@UTF-8 设置语言会正确打印出文件名(但这是一个不可接受的解决方案,因为整个系统现在以西班牙语运行.)在程序中显式设置语言环境,如下所示没有效果任何一个.下面我将程序修改为 a) 尝试打开文件和 b) 在无法打开文件时以 ASCII 和字节数组的形式打印出名称:

Setting the language via export LANG=es_ES@UTF-8 prints out the filename correctly (but is an unacceptable solution since the entire system is now running in Spanish.) Explicitly setting the Locale inside the program like the following has no effect either. Below I've modified the program to a) attempt to open the file and b) print out the name in both ASCII and as a byte array when it fails to open the file:

import java.io.*;
import java.util.Locale;
import java.text.*;

public class load_i18n
{
  public static void main( String [] args ) {
    // Stream to read file
    FileInputStream fin;

    Locale locale = new Locale("es", "ES");
    Locale.setDefault(locale);
    File actual = new File(".");
    System.out.println(Locale.getDefault());
    for( File f : actual.listFiles()){
      try {
        fin = new FileInputStream (f.getName());
      }
      catch (IOException e){
        System.err.println ("Can't open the file " + f.getName() + ".  Printing as byte array.");
        byte[] textArray = f.getName().getBytes();
        for(byte b: textArray){
          System.err.print(b + " ");
        }
        System.err.println();
        System.exit(-1);
      }

      System.out.println( f.getName() );
    }
  }
}

这会产生输出

es_ES
load_i18n.class
Can't open the file special_�_�_�_characters.doc.  Printing as byte array.
115 112 101 99 105 97 108 95 -17 -65 -67 95 -17 -65 -67 95 -17 -65 -67 95 99 104 97 114 97 99 116 101 114 115 46 100 111 99

这表明问题不仅仅是控制台显示的问题,因为相同的字符及其表示以字节或 ASCII 格式输出.事实上,即使在对某些实用程序(如 bash 的 echo)使用 LANG=en_US.UTF-8 时,控制台显示也能正常工作:

This shows that the issue is NOT just an issue with console display as the same characters and their representations are output in byte or ASCII format. In fact, console display does work even when using LANG=en_US.UTF-8 for some utilities like bash's echo:

[mjuric@arrhchadm30 tmp]$ echo $LANG
en_US.UTF-8
[mjuric@arrhchadm30 tmp]$ echo *
load_i18n.class special_á_ã_è_characters.doc
[mjuric@arrhchadm30 tmp]$ ls
load_i18n.class  special_?_?_?_characters.doc
[mjuric@arrhchadm30 tmp]$

有没有可能修改这段代码,在Linux下用LANG=en_US.UTF-8运行时,读取文件名可以成功打开?

Is it possible to modify this code in such a way that when run under Linux with LANG=en_US.UTF-8, it reads the file name in such a way that it can be successfully opened?

推荐答案

首先,使用的字符编码与语言环境没有直接关系.所以改变语言环境不会有太大帮助.

First, the character encoding used is not directly related to the locale. So changing the locale won't help much.

其次,� 是典型的 Unicode 替换字符 U+FFFD 以 ISO-8859-1 而不是 UTF-8 打印.这是一个证据:

Second, the � is typical for the Unicode replacement character U+FFFD being printed in ISO-8859-1 instead of UTF-8. Here's an evidence:

System.out.println(new String("�".getBytes("UTF-8"), "ISO-8859-1")); // �

所以有两个问题:

  1. 您的 JVM 将这些特殊字符读取为 .
  2. 您的控制台使用 ISO-8859-1 来显示字符.

对于 Sun JVM,VM 参数 -Dfile.encoding=UTF-8 应该解决第一个问题.第二个问题是在控制台设置中修复.例如,如果您使用的是 Eclipse,则可以在 Window > Preferences > General > Workspace > Text File Encoding 中更改它.也将其设置为 UTF-8.

For a Sun JVM, the VM argument -Dfile.encoding=UTF-8 should fix the first problem. The second problem is to be fixed in the console settings. If you're using for example Eclipse, you can change it in Window > Preferences > General > Workspace > Text File Encoding. Set it to UTF-8 as well.

更新:根据您的更新:

byte[] textArray = f.getName().getBytes();

为了排除平台默认编码的影响,应该是以下内容:

That should have been the following to exclude influence of platform default encoding:

byte[] textArray = f.getName().getBytes("UTF-8");

如果仍然显示相同,那么问题就更深了.你到底在使用什么 JVM?做一个 java -version.如前所述,-Dfile.encoding 参数是特定于 Sun JVM 的.某些 Linux 机器附带 GNU JVM 或 OpenJDK 的 JVM,因此此参数可能不起作用.

If that still displays the same, then the problem lies deeper. What JVM exactly are you using? Do a java -version. As said before, the -Dfile.encoding argument is Sun JVM specific. Some Linux machines ships with GNU JVM or OpenJDK's JVM and this argument may then not work.

这篇关于如何在 Java 中打开包含重音符号的文件?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆