File.listFiles() 使用 JDK 6 破坏 Unicode 名称(Unicode 规范化问题) [英] File.listFiles() mangles unicode names with JDK 6 (Unicode Normalization issues)

查看:20
本文介绍了File.listFiles() 使用 JDK 6 破坏 Unicode 名称(Unicode 规范化问题)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在 OS X 和 Linux 上的 Java 6 中列出目录内容时,我正在努力解决一个奇怪的文件名编码问题:File.listFiles() 和相关方法似乎在与系统其他部分不同的编码.

I'm struggling with a strange file name encoding issue when listing directory contents in Java 6 on both OS X and Linux: the File.listFiles() and related methods seem to return file names in a different encoding than the rest of the system.

请注意,导致我出现问题的不仅仅是这些文件名的显示.我主要感兴趣的是将文件名与远程文件存储系统进行比较,因此我更关心名称字符串的内容,而不是用于打印输出的字符编码.

Note that it is not merely the display of these file names that is causing me problems. I'm mainly interested in doing a comparison of file names with a remote file storage system, so I care more about the content of the name strings than the character encoding used to print output.

这是一个演示程序.它创建一个具有 Unicode 名称的文件,然后打印出从直接创建的 File 中获得的文件名的 URL 编码 版本,以及在父目录下列出的相同文件(您应该运行此代码在一个空目录中).结果显示了 File.listFiles() 方法返回的不同编码.

Here is a program to demonstrate. It creates a file with a Unicode name then prints out URL-encoded versions of the file names obtained from the directly-created File, and the same file when listed under a parent directory (you should run this code in an empty directory). The results show the different encoding returned by the File.listFiles() method.

String fileName = "Trîcky Nåme";
File file = new File(fileName);
file.createNewFile();
System.out.println("File name: " + URLEncoder.encode(file.getName(), "UTF-8"));

// Get parent (current) dir and list file contents
File parentDir = file.getAbsoluteFile().getParentFile();
File[] children = parentDir.listFiles();
for (File child: children) {
    System.out.println("Listed name: " + URLEncoder.encode(child.getName(), "UTF-8"));
}

这是我在系统上运行此测试代码时得到的结果.请注意 %CC%C3 字符表示.

Here's what I get when I run this test code on my systems. Note the %CC versus %C3 character representations.

OS X 雪豹:

File name: Tri%CC%82cky+Na%CC%8Ame
Listed name: Tr%C3%AEcky+N%C3%A5me

$ java -version
java version "1.6.0_20"
Java(TM) SE Runtime Environment (build 1.6.0_20-b02-279-10M3065)
Java HotSpot(TM) 64-Bit Server VM (build 16.3-b01-279, mixed mode)

KUbuntu Linux(在同一 OS X 系统上的 VM 中运行):

KUbuntu Linux (running in a VM on same OS X system):

File name: Tri%CC%82cky+Na%CC%8Ame
Listed name: Tr%C3%AEcky+N%C3%A5me

$ java -version
java version "1.6.0_18"
OpenJDK Runtime Environment (IcedTea6 1.8.1) (6b18-1.8.1-0ubuntu1)
OpenJDK Client VM (build 16.0-b13, mixed mode, sharing)

我尝试了各种技巧来使字符串一致,包括设置 file.encoding 系统属性和各种 LC_CTYPELANG 环境变量.没有任何帮助,我也不想诉诸此类黑客.

I have tried various hacks to get the strings to agree, including setting the file.encoding system property and various LC_CTYPE and LANG environment variables. Nothing helps, nor do I want to resort to such hacks.

这个(有些相关?)问题不同,我能够尽管名称奇怪,但仍从列出的文件中读取数据

Unlike this (somewhat related?) question, I am able to read data from the listed files despite the odd names

推荐答案

使用 Unicode,有不止一种有效的方式来表示同一个字母.您在 Tricky Name 中使用的字符是带抑扬符的拉丁小写字母 i"和上方带环的拉丁小写字母 a".

Using Unicode, there is more than one valid way to represent the same letter. The characters you're using in your Tricky Name are a "latin small letter i with circumflex" and a "latin small letter a with ring above".

您说注意 %CC%C3 字符表示",但仔细观察您看到的是序列

You say "Note the %CC versus %C3 character representations", but looking closer what you see are the sequences

i 0xCC 0x82 vs. 0xC3 0xAE
a 0xCC 0x8A vs. 0xC3 0xA5

也就是说,第一个是字母 i 后跟 0xCC82,这是 Unicodeu0302组合抑扬符重音"字符,而第二个是u00EE 带抑扬符的拉丁小写字母 i".对另一对类似,第一个是字母 a 后跟 0xCC8A 是组合环上方"字符,第二个是拉丁小写字母 a 与上方环".这两种都是有效 Unicode 字符串的有效 UTF-8 编码,但一种采用组合"格式,另一种采用分解"格式.

That is, the first is letter i followed by 0xCC82 which is the UTF-8 encoding of the Unicodeu0302 "combining circumflex accent" character while the second is UTF-8 for u00EE "latin small letter i with circumflex". Similarly for the other pair, the first is the letter a followed by 0xCC8A the "combining ring above" character and the second is "latin small letter a with ring above". Both of these are valid UTF-8 encodings of valid Unicode character strings, but one is in "composed" and the other in "decomposed" format.

OS X HFS Plus 卷将字符串(例如文件名)存储为完全分解".Unix 文件系统实际上是根据文件系统驱动程序选择的存储方式来存储的.您不能对不同类型的文件系统做出任何笼统的声明.

OS X HFS Plus volumes store strings (e.g. filenames) as "fully decomposed". A Unix file-system is really stored according to how the filesystem driver chooses to store it. You can't make any blanket statements across different types of filesystems.

请参阅关于 Unicode 等效性的维基百科文章,了解组合形式与分解形式的一般性讨论,其中专门提到了 OS X.

See the Wikipedia article on Unicode Equivalence for general discussion of composed vs decomposed forms, which mentions OS X specifically.

请参阅 Apple 的技术问答 QA1235(在不幸的是,Objective-C)有关转换表单的信息.

See Apple's Tech Q&A QA1235 (in Objective-C unfortunately) for information on converting forms.

关于 Apple 的 Java 的最近的电子邮件线程-dev 邮件列表可能对您有所帮助.

A recent email thread on Apple's java-dev mailing list could be of some help to you.

基本上,您需要先将分解形式归一化为组合形式,然后才能比较字符串.

Basically, you need to normalize the decomposed form into a composed form before you can compare the strings.

这篇关于File.listFiles() 使用 JDK 6 破坏 Unicode 名称(Unicode 规范化问题)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆