如何在Java中打开包含重音符的文件? [英] How can I open files containing accents in Java?

查看:114
本文介绍了如何在Java中打开包含重音符的文件?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

编辑以澄清并添加一些代码



您好,
我们需要解析用户发送的数据世界各地的。我们的Linux系统有一个默认的区域设置en_US.UTF-8。但是,我们经常收到带有变音符号的文件,例如special_á_ã_è_characters.doc。尽管操作系统可以很好地处理这些文件,并且strace显示操作系统将正确的文件名传递给Java程序,但Java会将名称抛出并抛出一个未找到文件io异常尝试打开它们。



这个简单的程序可以说明问题:

  import java.io. *; 
import java.text。*;

public class load_i18n
{
public static void main(String [] args){
File actual = new File(。);
for(File f:actual.listFiles()){
System.out.println(f.getName());
}
}
}

在目录中运行此程序包含文件special_á_ã_è_characters.doc和默认美国英语区域设置:



special_�_�_�_characters。 doc



通过导出设置语言LANG = es_ES @ UTF-8正确打印文件名(但由于整个系统现在都以西班牙语运行,因此是不可接受的解决方案)。在程序中显式地设置Locale如下所示也没有任何效果。下面我修改了程序a)尝试打开文件和b)打印出的名称以ASCII和字节数组当它无法打开文件:

  import java.io. *; 
import java.util.Locale;
import java.text。*;

public class load_i18n
{
public static void main(String [] args){
//流到读取文件
FileInputStream fin;

Locale locale = new Locale(es,ES);
Locale.setDefault(locale);
文件actual = new File(。);
System.out.println(Locale.getDefault());
for(File f:actual.listFiles()){
try {
fin = new FileInputStream(f.getName());
}
catch(IOException e){
System.err.println(无法打开文件+ f.getName()+。
byte [] textArray = f.getName()。getBytes();
for(byte b:textArray){
System.err.print(b +);
}
System.err.println();
System.exit(-1);
}

System.out.println(f.getName());
}
}
}

p>

  es_ES 
load_i18n.class
无法打开文件special_�_�_�_characters.doc。打印为字节数组。
115 112 101 99 105 97 108 95 -17 -65 -67 95 -17 -65 -67 95 -17 -65 -67 95 99 104 97 114 97 99 116 101 114 115 46 100 111 99

这表明问题不仅仅是控制台显示的问题,因为相同的字符及其表示以字节形式输出或ASCII格式。事实上,即使对于像bash的echo这样的实用程序使用LANG = en_US.UTF-8,控制台显示也可以工作:

  [mjuric @ arrhchadm30 tmp] $ echo $ LANG 
en_US.UTF-8
[mjuric @ arrhchadm30 tmp] $ echo *
load_i18n.classspecial_á_ã_è_characters.doc
[mjuric @ arrhchadm30 tmp] $ ls
load_i18n.class special _?_?_?_ characters.doc
[mjuric @ arrhchadm30 tmp] $

是否可以修改此代码,以便在Linux下使用LANG = en_US.UTF-8运行时,它以这样的方式读取文件名,以便可以成功打开? / p>

解决方案

首先,使用的字符编码与语言环境不直接相关。因此更改区域设置不会有什么帮助。



其次,� href =http://www.fileformat.info/info/unicode/char/fffd/index.htm =nofollow noreferrer> Unicode替换字符U + FFFD 在ISO-8859-1而不是UTF-8中打印。这是一个证据:

  System.out.println(new String( .getBytes(UTF-8 ISO-8859-1)); //�

因此有两个问题:


  1. 您的JVM正在读取

  2. 您的控制台正在使用


  3. 对于Sun JVM,VM参数 -Dfile.encoding = UTF-8 应该解决第一个问题。第二个问题是在控制台设置中修复。如果您使用的是Eclipse,您可以在窗口>首选项>常规>工作区>文本文件编码中进行更改。

    更新:根据您的更新: / p>

      byte [] textArray = f.getName()。getBytes 

    这应该是以下排除平台默认编码的影响:

      byte [] textArray = f.getName()。getBytes(UTF-8); 

    如果仍然显示相同,则问题更深。你正在使用什么JVM?执行 java -version 。如前所述, -Dfile.encoding 参数是Sun JVM特定的。一些Linux机器附带GNU JVM或OpenJDK的JVM,这个参数可能无法正常工作。


    (editing for clarification and adding some code)

    Hello, We have a requirement to parse data sent from users all over the world. Our Linux systems have a default locale of en_US.UTF-8. However, we often receive files with diacritical marks in their names such as "special_á_ã_è_characters.doc". Though the OS can deal with these files fine, and an strace shows the OS passing the correct file name to the Java program, Java munges the names and throws a "file not found" io exception trying to open them.

    This simple program can illustrate the issue:

    import java.io.*;
    import java.text.*;
    
    public class load_i18n
    {
      public static void main( String [] args ) {
        File actual = new File(".");
        for( File f : actual.listFiles()){
          System.out.println( f.getName() );
        }
      }
    }
    

    Running this program in a directory containing the file special_á_ã_è_characters.doc and the default US English locale gives:

    special_�_�_�_characters.doc

    Setting the language via export LANG=es_ES@UTF-8 prints out the filename correctly (but is an unacceptable solution since the entire system is now running in Spanish.) Explicitly setting the Locale inside the program like the following has no effect either. Below I've modified the program to a) attempt to open the file and b) print out the name in both ASCII and as a byte array when it fails to open the file:

    import java.io.*;
    import java.util.Locale;
    import java.text.*;
    
    public class load_i18n
    {
      public static void main( String [] args ) {
        // Stream to read file
        FileInputStream fin;
    
        Locale locale = new Locale("es", "ES");
        Locale.setDefault(locale);
        File actual = new File(".");
        System.out.println(Locale.getDefault());
        for( File f : actual.listFiles()){
          try {
            fin = new FileInputStream (f.getName());
          }
          catch (IOException e){
            System.err.println ("Can't open the file " + f.getName() + ".  Printing as byte array.");
            byte[] textArray = f.getName().getBytes();
            for(byte b: textArray){
              System.err.print(b + " ");
            }
            System.err.println();
            System.exit(-1);
          }
    
          System.out.println( f.getName() );
        }
      }
    }
    

    This produces the output

    es_ES
    load_i18n.class
    Can't open the file special_�_�_�_characters.doc.  Printing as byte array.
    115 112 101 99 105 97 108 95 -17 -65 -67 95 -17 -65 -67 95 -17 -65 -67 95 99 104 97 114 97 99 116 101 114 115 46 100 111 99
    

    This shows that the issue is NOT just an issue with console display as the same characters and their representations are output in byte or ASCII format. In fact, console display does work even when using LANG=en_US.UTF-8 for some utilities like bash's echo:

    [mjuric@arrhchadm30 tmp]$ echo $LANG
    en_US.UTF-8
    [mjuric@arrhchadm30 tmp]$ echo *
    load_i18n.class special_á_ã_è_characters.doc
    [mjuric@arrhchadm30 tmp]$ ls
    load_i18n.class  special_?_?_?_characters.doc
    [mjuric@arrhchadm30 tmp]$
    

    Is it possible to modify this code in such a way that when run under Linux with LANG=en_US.UTF-8, it reads the file name in such a way that it can be successfully opened?

    解决方案

    First, the character encoding used is not directly related to the locale. So changing the locale won't help much.

    Second, the � is typical for the Unicode replacement character U+FFFD being printed in ISO-8859-1 instead of UTF-8. Here's an evidence:

    System.out.println(new String("�".getBytes("UTF-8"), "ISO-8859-1")); // �
    

    So there are two problems:

    1. Your JVM is reading those special characters as .
    2. Your console is using ISO-8859-1 to display characters.

    For a Sun JVM, the VM argument -Dfile.encoding=UTF-8 should fix the first problem. The second problem is to be fixed in the console settings. If you're using for example Eclipse, you can change it in Window > Preferences > General > Workspace > Text File Encoding. Set it to UTF-8 as well.


    Update: As per your update:

    byte[] textArray = f.getName().getBytes();
    

    That should have been the following to exclude influence of platform default encoding:

    byte[] textArray = f.getName().getBytes("UTF-8");
    

    If that still displays the same, then the problem lies deeper. What JVM exactly are you using? Do a java -version. As said before, the -Dfile.encoding argument is Sun JVM specific. Some Linux machines ships with GNU JVM or OpenJDK's JVM and this argument may then not work.

    这篇关于如何在Java中打开包含重音符的文件?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆