zip内文件的编码(C#/ ionic-zip) [英] Encoding of files inside a zip (C# / ionic-zip)

查看:279
本文介绍了zip内文件的编码(C#/ ionic-zip)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我们遇到了zip文件中文件编码的问题。
我们正在使用ionic zip压缩和解压缩档案。
我们位于丹麦,因此我们经常在文件名中包含包含æ,ø或å的文件。



当用户使用内置的Windows系统时,在用于压缩文件的工具中,然后我发现它正在使用IBM437 enconding,当其中包含带有'ø'/'Ø'的文件时,这只是给出了一些时髦的结果。我用以下代码解决了这个问题:

 公共静态字符串IBM437Encode(此字符串文本)
{
返回text.Replace('ø','¢')。Replace('Ø','¥');
}
公共静态字符串IBM437Decode(此字符串文本)
{
返回text.Replace('¢','ø')。Replace('¥','Ø' );
}

现在已经运行了一段时间,一切都很好。 / p>

但是,因为总是有一个but,所以我们没有尝试使用Mac osx中默认工具压缩的文件来尝试使用。
所以现在我们遇到了一个新问题。
当使用æ,ø和å时,编码为UTF-8!
因此,如果我知道zip的压缩位置,我就可以使它工作,但是有没有简便的方法可以检测或标准化zip中的编码?

解决方案

检测编码始终是一项棘手的事情,但是UTF8 对于在有效序列中期望的值具有严格的按位规则,并且您可以初始化UTF8Encoding对象在这些序列不正确的情况下会引发异常,从而失败

 公共静态布尔值MatchesUtf8Encoding(Byte [] bytes)
{
UTF8Encoding enc = new UTF8Encoding(false,true);
try {enc.GetString(bytes)}
catch(ArgumentException){返回false; }
返回true;
}

如果对zip中的所有文件名运行该文件,则可以确定它会在任何地方失败,在这种情况下,您可以得出结论,名称未保存为UTF-8。






请注意,除了UTF-8计算机的默认编码( Encoding.Default ,在美国和西欧国家/地区通常为Windows-1252)之间也有令人讨厌的区别。 )和您已经遇到的DOS-437编码。



要区分两者之间的区别非常非常困难,可能需要通过实际检查每种编码来完成超出字节0x80的范围会产生普通的带重音符号的字符,并且通常是文件名中通常不会遇到的特殊字符。例如,许多DOS-437字符是用于在DOS中绘制半图形用户界面的框架。



作为参考,这些是特殊字符(因此在DOS-437中为字节范围0x80-0xFF)。

 
80ÇüéâäàåççëëèïïìÄÅ
90ÉæÆôöòûûûÿÿÜÜ¢£¥₧ƒ
A0áíóúñѪº¿⌐¬½¼¡«»
B0░▒▓│┤╡╢╖╕╣║╗╝╜╛┐
C0└┴┬├─┼╞╟╚╔ ═╬╧
D0╨╤╥╙╘╒╓╫╪┘┌█▄▌▐▀
E0αßΓπΣσμτΦΘΩδ∞φε∩
F0≡±≥≤⌠⌡÷≈°∙·√ ■²■

在Windows-1252中:

 
80€$$$$ A $ $ 90
90
A0¡
B0°±²³³µ¶·¸ºº¼½¾¿
C0ÁÁÃÄÅÅÆÇÈÉÊËÌÍÏ
D0ÐÑÒÓÔÕÖ×ØÙÚÛÜÝÞß
E0àáâããååçééëë$ F0

其中一些甚至无法打印,因此使操作起来更容易。



如您所见,通常,DOS-437在0x80-0xA5区域中具有大多数带重音符号的字符(Beta经常为0xE1)在德国以 eszett 的形式保存),而Win-1252实际上在区域0xC0-0xFF。如果确定了这些区域,则可以建立一个扫描机制来评估它似乎倾向于哪种编码,只需简单地计算每个目标的预期范围之内和之外的数量即可。






请注意,c#中的 Char 表示一个unicode字符,无论它是作为字节加载的,并且unicode字符都有您可以通过编程方式查找分类,以区分普通字母(可能是变音符号)和各种特殊字符(简单的示例:我知道这些类别之一是空白字符)。可能值得研究该系统以自动确定什么是正常语言字符。


We got a problem with the encoding of files inside a zip-file. We are using the ionic zip to compress and decompress archives. We are a located in Denmark, so we often have files containing æ, ø or å in the file-names.

When a user uses windows built-in tool to compress files, then I found that it was using the IBM437 enconding, this just gave some funky results when we had files with 'ø' / 'Ø' in them. This I fixed with the following code:

public static string IBM437Encode(this string text)
{
    return text.Replace('ø', '¢').Replace('Ø', '¥');
}
public static string IBM437Decode(this string text)
{
    return text.Replace('¢', 'ø').Replace('¥', 'Ø');
}

This has been running for some time now, and all has been fine.

But, because theres always a but, we didn't try it with a file compressed with the default tool in mac osx. So now we got a new problem.. When using æ, ø and å the encoding is UTF-8! So I can get it to work if I know where the zip has been compressed, but is there any easy way to detect or normalize the encoding inside a zip?

解决方案

Detecting encoding is always a tricky business, but UTF8 has strict bitwise rules about what values are expected in a valid sequence, and you can initialize a UTF8Encoding object in a way that will fail by throwing an exception when these sequences are incorrect:

public static Boolean MatchesUtf8Encoding(Byte[] bytes)
{
    UTF8Encoding enc = new UTF8Encoding(false, true);
    try { enc.GetString(bytes) }
    catch(ArgumentException) { return false; }
    return true;
}

If you'd run that over all filenames in a zip you can determine if it fails anywhere, in which case you can conclude the names are not saved as UTF-8.


Do note that besides UTF-8 there's also the annoying difference between the computer's default encoding (Encoding.Default, usually Windows-1252 in US and Western EU countries, but annoyingly different depending on regions and languages) and the DOS-437 encoding you already encountered.

Making the distinction between those is very, very hard, and would probably need to be done by actually checking for each encoding which ranges beyond byte 0x80 produce normal accented characters, and which are special characters you generally won't expect to encounter in a file name. For example, a lot of the DOS-437 characters are frames that were used to draw semi-graphical user interfaces in DOS.

For reference, these are the special characters (so the byte range 0x80-0xFF) in DOS-437:

80    ÇüéâäàåçêëèïîìÄÅ
90    ÉæÆôöòûùÿÖÜ¢£¥₧ƒ
A0    áíóúñѪº¿⌐¬½¼¡«»
B0    ░▒▓│┤╡╢╖╕╣║╗╝╜╛┐
C0    └┴┬├─┼╞╟╚╔╩╦╠═╬╧
D0    ╨╤╥╙╘╒╓╫╪┘┌█▄▌▐▀
E0    αßΓπΣσµτΦΘΩδ∞φε∩
F0    ≡±≥≤⌠⌡÷≈°∙·√ⁿ²■ 

And in Windows-1252:

80    €�‚ƒ„…†‡ˆ‰Š‹Œ�Ž�
90    �‘’""•–—˜™š›œ�žŸ
A0     ¡¢£¤¥¦§¨©ª«¬�®¯
B0    °±²³´µ¶·¸¹º»¼½¾¿
C0    ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏ
D0    ÐÑÒÓÔÕÖ×ØÙÚÛÜÝÞß
E0    àáâãäåæçèéêëìíîï
F0    ðñòóôõö÷øùúûüýþÿ

Some of these aren't even printable, so that makes it a bit easier.

As you see, generally, DOS-437 has most of its accented characters in the 0x80-0xA5 region (with the Beta at 0xE1 often used in Germany as eszett), whereas Win-1252 has practically all of them in the region 0xC0-0xFF. If you determine these regions you can make a scan mechanism that evaluates which encoding it seems to lean towards, simply by counting how many fall inside and outside the expected ranges for each.


Note that Char in c# represents a unicode character, no matter what it was loaded from as bytes, and unicode characters have certain classifications you can look up programmatically that distinguish them between normal letters (possibly with diacritics) and various classes of special characters (simple example: I know one of these classes is "whitespace characters"). It may be worth looking into that system to automate the process of determining what "normal language characters" are.

这篇关于zip内文件的编码(C#/ ionic-zip)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆