Java,解压缩文件名中包含德语字符的文件夹 [英] Java, unzip folder with German characters in filenames

查看:30
本文介绍了Java,解压缩文件名中包含德语字符的文件夹的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试解压缩包含德语字符的文件夹,例如 Aufhänge .我知道在 Java 7 中,它默认使用 utf-8,我认为ä"是 utf-8 字符之一.这是我的代码片段

public static void main(String[] args) 抛出 IOException {ZipInputStream zipInputStream = new ZipInputStream(new FileInputStream(ZIP_PATH), StandardCharsets.UTF_8);ZipEntry zipEntry;而 ((zipEntry = zipInputStream.getNextEntry()) != null) {System.out.println(zipEntry.getName());}}

这是我得到的错误:java.lang.IllegalArgumentException: MALFORMED

它适用于 Charset.forName("Cp437"),但不适用于 StandardCharsets.UTF_8

解决方案

您没有提到您的操作系统,也没有提到您是如何创建 zip 文件的,但我还是设法使用 7-Zip 重新创建了您的问题 在 Windows 10 上:

  • 创建一个包含一些琐碎内容的简单文本文件(例如,只有三个字符abc").
  • 将文件另存为 D:\Temp\Aufhänge.txt.请注意文件名中的变音符号.
  • 在 Windows 文件资源管理器中找到该文件.
  • 选择文件并右键单击.从上下文菜单中选择 7-Zip > 添加到Aufhänge.zip" 以创建 Aufhänge.zip.

然后,在 NetBeans 中运行以下代码以解压缩您刚刚创建的文件:

import java.io.FileInputStream;导入 java.io.FileNotFoundException;导入 java.io.IOException;导入 java.nio.charset.Charset;导入 java.util.zip.ZipEntry;导入 java.util.zip.ZipInputStream;公共课 GermanZip {static String ZIP_PATH = "D:\\Temp\\Aufhänge.zip";public static void main(String[] args) 抛出 FileNotFoundException,IOException {ZipInputStream zipInputStream = new ZipInputStream(new FileInputStream(ZIP_PATH), Charset.forName("UTF-8"));ZipEntry zipEntry;而 ((zipEntry = zipInputStream.getNextEntry()) != null) {System.out.println(zipEntry.getName());}}}

正如您所指出的,代码在执行以下语句时抛出 java.lang.IllegalArgumentException: MALFORMED:zipEntry = zipInputStream.getNextEntry()) != null.

出现问题是因为默认情况下 7-Zip 使用 Cp437 对 zip 文件中的文件名称进行编码,如

  • 以 UTF-8 格式存储压缩文件名后,您可以将 Charset.forName("Cp437") 替换为 Charset.forName("UTF-8") 在你的代码中,解压时不会抛出异常.

  • 此答案特定于 Windows 10 和 7-Zip,但一般原则应适用于任何环境:如果为 ZipInputStream 指定 UTF-8 编码,请确保zip 文件确实是使用 UTF-8 编码的.您可以通过在二进制编辑器中打开 zip 文件并搜索压缩文件的名称来轻松验证这一点.

    <小时>

    根据以下 OP 的评论/问题进行更新:

    • 遗憾的是,.ZIP 文件格式规范目前不提供除了一个例外,一种存储用于压缩文件名的编码的方法,如附录 D - 语言编码 (EFS)"中所述:

      <块引用>

      D.2 如果通用位 11 未设置,文件名和注释应该符合原始 ZIP 字符编码.如果一般目的位 11 已设置,文件名和注释必须支持Unicode 标准,版本 4.1.0 或更高版本使用字符UTF-8 存储规范定义的编码形式.Unicode 标准由 The Unicode Consortium 发布(www.unicode.org).存储在 ZIP 文件中的 UTF-8 编码数据是预计不包含字节顺序标记 (BOM).

    • 因此,在您的代码中,对于每个压缩文件,首先检查是否设置了通用位标志的第 11 位.如果是,那么您可以确定该压缩文件的名称是使用 UTF-8 编码的.否则,编码就是创建压缩文件时使用的任何编码.在 Windows 上默认是 Cp437,但如果您在 Windows 上运行并处理在 Linux 上创建的 zip 文件,我认为没有一种简单的方法可以确定所使用的编码.

    • 不幸的是ZipEntry 不提供访问压缩文件的通用位标志字段的方法,因此您需要在字节级别处理 zip 文件才能做到这一点.立>
    • 更复杂的是,此上下文中的编码"与用于每个压缩文件名的编码有关,而不是与 zip 文件本身有关.一个压缩文件名可以用 UTF-8 编码,另一个压缩文件名可以使用 Cp437 等方式添加.

    I'm trying to unzip folder that contains German characters in it, for example Aufhänge . I know that in Java 7, it is using utf-8 by default, and i think "ä" is one of the utf-8 characters. Here is my code snippet

    public static void main(String[] args) throws IOException {
        ZipInputStream zipInputStream = new ZipInputStream(new FileInputStream(ZIP_PATH), StandardCharsets.UTF_8);
        ZipEntry zipEntry;
        while ((zipEntry = zipInputStream.getNextEntry()) != null) {
            System.out.println(zipEntry.getName());
        }
    }
    

    This is an error that I get: java.lang.IllegalArgumentException: MALFORMED

    It works with Charset.forName("Cp437"), but it doesn't work with StandardCharsets.UTF_8

    解决方案

    You don't mention your operating system, nor how you created the zip file, but I managed to recreate your problem anyway, using 7-Zip on Windows 10:

    • Create a simple text file with some trivial content (e.g. nothing but the three characters "abc").
    • Save the file as D:\Temp\Aufhänge.txt. Note the umlaut in the file name.
    • Locate that file in Windows File Explorer.
    • Select the file and right click. From the context menu select 7-Zip > Add to "Aufhänge.zip" to create Aufhänge.zip.

    Then, in NetBeans run the following code to unzip the file you just created:

    import java.io.FileInputStream;
    import java.io.FileNotFoundException;
    import java.io.IOException;
    import java.nio.charset.Charset;
    import java.util.zip.ZipEntry;
    import java.util.zip.ZipInputStream;
    
    public class GermanZip {
    
        static String ZIP_PATH = "D:\\Temp\\Aufhänge.zip";
    
        public static void main(String[] args) throws FileNotFoundException, IOException {
    
            ZipInputStream zipInputStream = new ZipInputStream(new FileInputStream(ZIP_PATH), Charset.forName("UTF-8"));
            ZipEntry zipEntry;
            while ((zipEntry = zipInputStream.getNextEntry()) != null) {
                System.out.println(zipEntry.getName());
            }
        }
    
    }
    

    As you pointed out, the code throws java.lang.IllegalArgumentException: MALFORMED when executing this statement: zipEntry = zipInputStream.getNextEntry()) != null.

    The problem arises because by default 7-Zip encodes the names of the files within the zip file using Cp437, as noted in this comment from 7-Zip:

    Default encoding is OEM (DOS) encoding. It's for compatibility with old zip software.

    That's why the unzip works when using Charset.forName("Cp437") instead of Charset.forName("UTF-8").

    If you want to unzip using Charset.forName("UTF-8") then you have to force 7-Zip to encode the filenames within the zip in UTF-8. To do this specify the cu parameter when running 7-Zip, as noted in the linked comment:

    • In Windows File Explorer select the file and right click.
    • From the context menu select 7-Zip > Add to Archive...".
    • In the Add to Archive dialog specify cu in the Parameters field:

    • Having stored the zipped filenames in UTF-8 format, you can then replace Charset.forName("Cp437") with Charset.forName("UTF-8") in your code, and no exception will be thrown when unzipping.

    This answer is specific to Windows 10 and 7-Zip, but the general principle should apply in any environment: if specifying an encoding of UTF-8 for your ZipInputStream be certain that the filenames within the zip file really are encoded using UTF-8. You can easily verify this by opening the zip file in a binary editor and searching for the names of the zipped files.


    Update based on OP's comment/question below:

    • Unfortunately the .ZIP File Format Specification does not currently provide a way to store the encoding used for zipped file names apart from one exception, as described in "APPENDIX D - Language Encoding (EFS)":

      D.2 If general purpose bit 11 is unset, the file name and comment SHOULD conform to the original ZIP character encoding. If general purpose bit 11 is set, the filename and comment MUST support The Unicode Standard, Version 4.1.0 or greater using the character encoding form defined by the UTF-8 storage specification. The Unicode Standard is published by the The Unicode Consortium (www.unicode.org). UTF-8 encoded data stored within ZIP files is expected to not include a byte order mark (BOM).

    • So in your code, for each zipped file, first check whether bit 11 of the general purpose bit flag is set. If it is then you can be certain that the name of that zipped fie is encoded using UTF-8. Otherwise the encoding is whatever was used when the zipped file was created. That is Cp437 by default on Windows, but if you are running on Windows and processing a zip file created on Linux I don't think there is an easy way of determining the encoding(s) used.

    • Unfortunately ZipEntry does not provide a method to access the general purpose bit flag field of a zipped file, so you would need to process the zip file at the byte level to do that.
    • To add a further complication, "encoding" in this context relates to the encoding used for each zipped filename rather than for the zip file itself. One zipped file name could be encoded in UTF-8, another zipped file name could have been added using Cp437, etc.

    这篇关于Java,解压缩文件名中包含德语字符的文件夹的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

    查看全文
    登录 关闭
    扫码关注1秒登录
    发送“验证码”获取 | 15天全站免登陆