Java无法在文件名中打开具有代理Unicode值的文件? [英] Java Can't Open a File with Surrogate Unicode Values in the Filename?

查看:168
本文介绍了Java无法在文件名中打开具有代理Unicode值的文件?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在处理使用文件执行各种IO操作的代码,我希望能够处理国际文件名。我正在使用Java 1.5处理Mac,如果文件名包含需要代理的Unicode字符,则JVM似乎无法找到该文件。例如,我的测试文件是:

I'm dealing with code that does various IO operations with files, and I want to make it able to deal with international filenames. I'm working on a Mac with Java 1.5, and if a filename contains Unicode characters that require surrogates, the JVM can't seem to locate the file. For example, my test file is:

草鸥外.gif哪个被分成了Java字符 \ u8349 \ uD85B \ uDFF6 \ u9DD7 \ u5916.gif

"草鷗外.gif" which gets broken into the Java characters \u8349\uD85B\uDFF6\u9DD7\u5916.gif

如果我创建文件从这个文件名,我无法打开它,因为我得到一个FileNotFound异常。即使在包含该文件的文件夹上使用它也会失败:

If I create a file from this filename, I can't open it because I get a FileNotFound exception. Even using this on the folder containing the file will fail:

File[] files = folder.listFiles(); 
for (File file : files) {
    if (!file.exists()) {
        System.out.println("Failed to find File"); //Fails on the surrogate filename
    }
}

大部分代码我实际上处理的形式是:

Most of the code I am actually dealing with are of the form:

FileInputStream instream = new FileInputStream(new File("草鷗外.gif"));
// operations follow

我是否有办法解决这个问题,要么逃避文件名或打开文件有何不同?

Is there some way I can address this problem, either escaping the filenames or opening files differently?

推荐答案

我怀疑Java或Mac之一正在使用 CESU-8 而不是正确的UTF-8。 Java使用修改过的UTF-8(这是CESU-8的一个细微变化)用于各种内部目的,但我不知道它可以将它用作文件系统/ defaultCharset。不幸的是我在这里没有测试Mac和Java。

I suspect one of Java or Mac is using CESU-8 instead of proper UTF-8. Java uses "modified UTF-8" (which is a slight variation of CESU-8) for a variety of internal purposes, but I wasn't aware it could use it as a filesystem/defaultCharset. Unfortunately I have neither Mac nor Java here to test with.

修改是一种改进的说法严重错误。而不是为补充(非BMP)字符输出四字节UTF-8序列,例如𦿶:

"Modified" is a modified way of saying "badly bugged". Instead of outputting a four-byte UTF-8 sequence for supplementary (non-BMP) characters like 𦿶:

\xF0\xA6\xBF\xB6

它输出一个UTF-8编码的序列每个代理人:

it outputs a UTF-8-encoded sequence for each of the surrogates:

\xED\xA1\x9B\xED\xBF\xB6

这不是一个有效的UTF-8序列,但无论如何很多解码器都会允许它。问题是,如果你通过一个真正的UTF-8编码器往返,你有一个不同的字符串,上面的四字节字符串。尝试使用该名称和繁荣访问该文件!失败。

This isn't a valid UTF-8 sequence, but a lot of decoders will allow it anyway. Problem is if you round-trip that through a real UTF-8 encoder you've got a different string, the four-byte one above. Try to access the file with that name and boom! fail.

首先让我们检查文件名实际存储在当前文件系统下的方式,使用一个平台,使用文件名的字节,如Python 2.x:

So first let's just check how filenames are actually stored under your current filesystem, using a platform that uses bytes for filenames such as Python 2.x:

$ python
Python 2.x.something (blah blah)
Type "help", "copyright", "credits" or "license" for more information.
>>> import os
>>> os.listdir('.')

在我的文件系统(Linux,ext4,UTF-8)上,文件名草𦿶鸥外.gif出现:

On my filesystem (Linux, ext4, UTF-8), the filename "草𦿶鷗外.gif" comes out as:

['\xe8\x8d\x89\xf0\xa6\xbf\xb6\xe9\xb7\x97\xe5\xa4\x96.gif']

这就是你想要的。如果这就是你得到的,那可能是Java做错了。如果你得到更长的六字节字符版本:

which is what you want. If that's what you get, it's probably Java doing it wrong. If you get the longer six-byte-character version:

['\xe8\x8d\x89\xed\xa1\x9b\xed\xbf\xb6\xe9\xb7\x97\xe5\xa4\x96.gif']

它可能是OS X做错了...它总是存储这样的文件名吗? (或者这些文件最初来自其他地方?)如果您将文件重命名为正确版本怎么办?:

it's probably OS X doing it wrong... does it always store filenames like this? (Or did the files come from somewhere else originally?) What if you rename the file to the ‘proper’ version?:

os.rename('\xe8\x8d\x89\xed\xa1\x9b\xed\xbf\xb6\xe9\xb7\x97\xe5\xa4\x96.gif', '\xe8\x8d\x89\xf0\xa6\xbf\xb6\xe9\xb7\x97\xe5\xa4\x96.gif')

这篇关于Java无法在文件名中打开具有代理Unicode值的文件?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆