如何可靠地猜测 MacRoman、CP1252、Latin1、UTF-8 和 ASCII 之间的编码 [英] How to reliably guess the encoding between MacRoman, CP1252, Latin1, UTF-8, and ASCII

查看:26
本文介绍了如何可靠地猜测 MacRoman、CP1252、Latin1、UTF-8 和 ASCII 之间的编码的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在工作中,似乎每个星期都没有一些与编码相关的阴谋、灾难或灾难.这个问题通常源于那些认为他们可以在不指定编码的情况下可靠地处理文本"文件的程序员.但你不能.

因此决定从此以后禁止文件名称以*.txt*.text 结尾.想法是这些扩展会误导普通程序员对编码产生沉闷的自满情绪,这会导致处理不当.最好没有扩展,因为至少那时你知道你不知道你有什么.

但是,我们不会走那么远.相反,您将需要使用以编码结尾的文件名.因此,例如,对于文本文件,这些将类似于 README.asciiREADME.latin1README.utf8 等.>

对于需要特定扩展名的文件,如果可以在文件本身内部指定编码,例如在 Perl 或 Python 中,那么您应该这样做.对于文件内部不存在此类功能的 Java 源文件,您将在扩展名之前放置编码,例如 SomeClass-utf8.java.

对于输出,强烈首选 UTF-8.

但是对于输入,我们需要弄清楚如何处理我们代码库中名为 *.txt 的数千个文件.我们想重命名所有这些以适应我们的新标准.但我们不可能把它们全都盯上.所以我们需要一个真正有效的库或程序.

这些有不同的 ASCII、ISO-8859-1、UTF-8、Microsoft CP1252 或 Apple MacRoman.虽然我们知道我们可以判断某些东西是否是 ASCII,并且我们可以很好地了解某些东西是否可能是 UTF-8,但我们对 8 位编码感到困惑.因为我们在混合 Unix 环境(Solaris、Linux、Darwin)中运行,大多数桌面都是 Mac,所以我们有很多烦人的 MacRoman 文件.而这些尤其是一个问题.

一段时间以来,我一直在寻找一种方法来以编程方式确定哪个

  1. ASCII
  2. ISO-8859-1
  3. CP1252
  4. MacRoman
  5. UTF-8

有一个文件,我还没有找到可以可靠地区分这三种不同 8 位编码的程序或库.我们可能有超过一千个 MacRoman 文件,所以我们使用的任何字符集检测器都必须能够嗅出它们.我看过的任何东西都无法解决这个问题.我对 ICU 字符集检测器库寄予厚望,但它无法处理马克罗曼.我还研究了在 Perl 和 Python 中做同样事情的模块,但一次又一次,它总是同一个故事:不支持检测 MacRoman.

因此,我要寻找的是一个现有的库或程序,它可以可靠地确定文件采用这五种编码中的哪一种——最好是更多.特别是它必须区分我引用的三种 3 位编码,尤其是 MacRoman.文件99%以上为英文文本;其他语言也有一些,但不多.

如果是库代码,我们的语言偏好是使用 Perl、C、Java 或 Python,并按此顺序排列.如果它只是一个程序,那么我们真的不在乎它是什么语言,只要它有完整的源代码,在 Unix 上运行,并且完全不受阻碍.

有没有其他人遇到过随机编码的无数遗留文本文件的问题?如果是这样,你是如何尝试解决它的,你有多成功?这是我的问题中最重要的方面,但我也感兴趣的是,您是否认为鼓励程序员使用这些文件的实际编码来命名(或重命名)他们的文件将有助于我们在未来避免这个问题.有没有人尝试在制度基础上强制执行这一点,如果是, 成功与否,为什么?

是的,我完全理解为什么鉴于问题的性质,不能保证有明确的答案.对于小文件尤其如此,因为您没有足够的数据来处理.幸运的是,我们的文件很少很小.除了随机的README文件外,大部分都在50k到250k的大小范围内,很多都更大.任何超过几个 K 的大小都保证是英文的.

问题领域是生物医学文本挖掘,因此我们有时会处理广泛且极其庞大的语料库,例如 PubMedCentral 的所有开放访问存储库.一个相当大的文件是 BioThesaurus 6.0,大小为 5.7 GB.这个文件特别烦人,因为它几乎都是 UTF-8.然而,一些 numbskull 在其中插入了一些 8 位编码的行——我相信是 Microsoft CP1252.在你旅行之前需要很长时间.:(

解决方案

首先,简单的案例:

ASCII

如果您的数据不包含高于 0x7F 的字节,则它是 ASCII.(或 7 位 ISO646 编码,但那些已经过时了.)

UTF-8

如果您的数据验证为 UTF-8,那么您可以放心地假设它 UTF-8.由于 UTF-8 严格的验证规则,误报极为罕见.

ISO-8859-1 与 windows-1252

这两种编码之间的唯一区别是 ISO-8859-1 具有 C1 控制字符,而 windows-1252 具有可打印字符 €‚ƒ„…†‡ˆ‰Š‹ŒŽ''""•–-˜™š›œžŸ.我见过很多使用大引号或破折号的文件,但没有一个使用 C1 控制字符.因此,甚至不要理会它们或 ISO-8859-1,只需检测 windows-1252.

现在你只剩下一个问题了.

如何区分 MacRoman 和 cp1252?

这比较棘手.

未定义的字符

字节 0x81、0x8D、0x8F、0x90、0x9D 在 windows-1252 中不使用.如果出现,则假设数据是 MacRoman.

相同的字符

字节 0xA2 (¢)、0xA3 (£)、0xA9 (©)、0xB1 (±)、0xB5 (µ) 在两种编码中恰好相同.如果这些是唯一的非 ASCII 字节,那么选择 MacRoman 或 cp1252 都没有关系.

统计方法

计算您知道是 UTF-8 的数据中的字符(不是字节!)频率.确定最常见的字符.然后用这个数据来判断是cp1252还是MacRoman字符更常见.

例如,在我刚刚对 100 篇随机英文维基百科文章进行的搜索中,最常见的非 ASCII 字符是 ·•–é°®'èö—.基于这一事实,

  • 字节 0x92、0x95、0x96、0x97、0xAE、0xB0、0xB7、0xE8、0xE9 或 0xF6 建议使用 windows-1252.
  • 字节 0x8E、0x8F、0x9A、0xA1、0xA5、0xA8、0xD0、0xD1、0xD5 或 0xE1 建议使用 MacRoman.

计算 cp1252-suggesting 字节和 MacRoman-suggesting 字节数,取最大者.

At work it seems like no week ever passes without some encoding-related conniption, calamity, or catastrophe. The problem usually derives from programmers who think they can reliably process a "text" file without specifying the encoding. But you can't.

So it's been decided to henceforth forbid files from ever having names that end in *.txt or *.text. The thinking is that those extensions mislead the casual programmer into a dull complacency regarding encodings, and this leads to improper handling. It would almost be better to have no extension at all, because at least then you know that you don’t know what you’ve got.

However, we aren’t goint to go that far. Instead you will be expected to use a filename that ends in the encoding. So for text files, for example, these would be something like README.ascii, README.latin1, README.utf8, etc.

For files that demand a particular extension, if one can specify the encoding inside the file itself, such as in Perl or Python, then you shall do that. For files like Java source where no such facility exists internal to the file, you will put the encoding before the extension, such as SomeClass-utf8.java.

For output, UTF-8 is to be strongly preferred.

But for input, we need to figure out how to deal with the thousands of files in our codebase named *.txt. We want to rename all of them to fit into our new standard. But we can’t possibly eyeball them all. So we need a library or program that actually works.

These are variously in ASCII, ISO-8859-1, UTF-8, Microsoft CP1252, or Apple MacRoman. Although we're know we can tell if something is ASCII, and we stand a good change of knowing if something is probably UTF-8, we’re stumped about the 8-bit encodings. Because we’re running in a mixed Unix environment (Solaris, Linux, Darwin) with most desktops being Macs, we have quite a few annoying MacRoman files. And these especially are a problem.

For some time now I’ve been looking for a way to programmatically determine which of

  1. ASCII
  2. ISO-8859-1
  3. CP1252
  4. MacRoman
  5. UTF-8

a file is in, and I haven’t found a program or library that can reliably distinguish between those the three different 8-bit encodings. We probably have over a thousand MacRoman files alone, so whatever charset detector we use has to be able to sniff those out. Nothing I’ve looked at can manage the trick. I had big hopes for the ICU charset detector library, but it cannot handle MacRoman. I’ve also looked at modules to do the same sort of thing in both Perl and Python, but again and again it’s always the same story: no support for detecting MacRoman.

What I am therefore looking for is an existing library or program that reliably determines which of those five encodings a file is in—and preferably more than that. In particular it has to distinguish between the three 3-bit encoding I’ve cited, especially MacRoman. The files are more than 99% English language text; there are a few in other languages, but not many.

If it’s library code, our language preference is for it to be in Perl, C, Java, or Python, and in that order. If it’s just a program, then we don’t really care what language it’s in so long as it comes in full source, runs on Unix, and is fully unencumbered.

Has anyone else had this problem of a zillion legacy text files randomly encoded? If so, how did you attempt to solve it, and how successful were you? This is the most important aspect of my question, but I’m also interested in whether you think encouraging programmers to name (or rename) their files with the actual encoding those files are in will help us avoid the problem in the future. Has anyone ever tried to enforce this on an institutional basis, and if so, was that successful or not, and why?

And yes, I fully understand why one cannot guarantee a definite answer given the nature of the problem. This is especially the case with small files, where you don’t have enough data to go on. Fortunately, our files are seldom small. Apart from the random README file, most are in the size range of 50k to 250k, and many are larger. Anything more than a few K in size is guaranteed to be in English.

The problem domain is biomedical text mining, so we sometimes deal with extensive and extremely large corpora, like all of PubMedCentral’s Open Access respository. A rather huge file is the BioThesaurus 6.0, at 5.7 gigabytes. This file is especially annoying because it is almost all UTF-8. However, some numbskull went and stuck a few lines in it that are in some 8-bit encoding—Microsoft CP1252, I believe. It takes quite a while before you trip on that one. :(

解决方案

First, the easy cases:

ASCII

If your data contains no bytes above 0x7F, then it's ASCII. (Or a 7-bit ISO646 encoding, but those are very obsolete.)

UTF-8

If your data validates as UTF-8, then you can safely assume it is UTF-8. Due to UTF-8's strict validation rules, false positives are extremely rare.

ISO-8859-1 vs. windows-1252

The only difference between these two encodings is that ISO-8859-1 has the C1 control characters where windows-1252 has the printable characters €‚ƒ„…†‡ˆ‰Š‹ŒŽ‘’""•–—˜™š›œžŸ. I've seen plenty of files that use curly quotes or dashes, but none that use C1 control characters. So don't even bother with them, or ISO-8859-1, just detect windows-1252 instead.

That now leaves you with only one question.

How do you distinguish MacRoman from cp1252?

This is a lot trickier.

Undefined characters

The bytes 0x81, 0x8D, 0x8F, 0x90, 0x9D are not used in windows-1252. If they occur, then assume the data is MacRoman.

Identical characters

The bytes 0xA2 (¢), 0xA3 (£), 0xA9 (©), 0xB1 (±), 0xB5 (µ) happen to be the same in both encodings. If these are the only non-ASCII bytes, then it doesn't matter whether you choose MacRoman or cp1252.

Statistical approach

Count character (NOT byte!) frequencies in the data you know to be UTF-8. Determine the most frequent characters. Then use this data to determine whether the cp1252 or MacRoman characters are more common.

For example, in a search I just performed on 100 random English Wikipedia articles, the most common non-ASCII characters are ·•–é°®’èö—. Based on this fact,

  • The bytes 0x92, 0x95, 0x96, 0x97, 0xAE, 0xB0, 0xB7, 0xE8, 0xE9, or 0xF6 suggest windows-1252.
  • The bytes 0x8E, 0x8F, 0x9A, 0xA1, 0xA5, 0xA8, 0xD0, 0xD1, 0xD5, or 0xE1 suggest MacRoman.

Count up the cp1252-suggesting bytes and the MacRoman-suggesting bytes, and go with whichever is greatest.

这篇关于如何可靠地猜测 MacRoman、CP1252、Latin1、UTF-8 和 ASCII 之间的编码的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆