检测UTF中的实际字符集编码 [英] Detect actual charset encoding in UTF

查看:178
本文介绍了检测UTF中的实际字符集编码的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

使用某种映射或启发式方法来检测字符串编码的好工具.

Need good tool to detect encoding of the strings using some kind of mapping or heuristic method.

例如字符串:áÞåàÐÝØÒ ÜÝÞÓÞ ßàØÛÞÖÕÝØÙ Java, ÜÞÖÝÞ ×ÐÝïâì Òáî ÔÞáâãßÝãî ßÐÜïâì

预期:сохранив много приложений Java, можно занять всю доступную память

编码为"ISO8859-5".当我尝试使用下面的库检测到它时,结果为"UTF-8".显然,字符串已保存在utf中,但是是否有任何使用符号映射的启发式方法来分析字符并以正确的编码将它们匹配?

The encoding is "ISO8859-5". When I'am trying to detect it with the below libs the result is "UTF-8". It is obviously that string was saved in utf, but is there any heuristic way using symbols mapping to analyse the characters and match them with the correct encoding?

使用常规编码检测库:

- enca (aptitude install enca)
- chardet (aptitude install chardet)
- uchardet (aptitude install uchardet)
- http://tika.apache.org/
- http://npmjs.com/package/detect-encoding
- libencode-detect-perl
- http://www-archive.mozilla.org/projects/intl/UniversalCharsetDetection.html
- http://jchardet.sourceforge.net/
- http://grepcode.com/snapshot/repo1.maven.org/maven2/com.googlecode.juniversalchardet/juniversalchardet/1.0.3/
- http://lxr.mozilla.org/seamonkey/source/extensions/universalchardet/src/
- http://userguide.icu-project.org/
- http://site.icu-project.org

推荐答案

您需要解开UTF-8编码,然后 将其传递给字符编码检测库.

You need to unwrap the UTF-8 encoding and then pass it to a character-encoding detection library.

如果将随机的8位数据编码为UTF-8(假设使用身份映射,即C4字节被认为代表U + 00C4,则ISO-8859-1及其超集Windows 1252就是这种情况),你最终会得到类似的东西

If random 8-bit data is encoded into UTF-8 (assuming an identity mapping, i.e. a C4 byte is assumed to represent U+00C4, as is the case with ISO-8859-1 and its superset Windows 1252), you end up with something like

Source:  8F    0A 20 FE    65
Result:  C2 8F 0A 20 C3 BE 65

(因为U + 008F的UTF-8编码为C2 8F,而U + 00FE的编码为C3 BE).您需要还原此编码以获得源字符串,以便随后标识其字符编码.

(because the UTF-8 encoding of U+008F is C2 8F, and U+00FE is C3 BE). You need to revert this encoding in order to obtain the source string, so that you can then identify its character encoding.

在Python中,类似

In Python, something like

#!/usr/bin/env python
# -*- coding: utf-8 -*-

import chardet

mystery = u'áÞåàÐÝØÒ ÜÝÞÓÞ ßàØÛÞÖÕÝØÙ Java, ÜÞÖÝÞ ×ÐÝïâì Òáî ÔÞáâãßÝãî ßÐÜïâì'
print chardet.detect(mystery.encode('cp1252'))

结果:

{'confidence': 0.99, 'encoding': 'ISO-8859-5'}

在Unix命令行上,

vnix$ echo 'áÞåàÐÝØÒ ÜÝÞÓÞ ßàØÛÞÖÕÝØÙ Java, ÜÞÖÝÞ ×ÐÝïâì Òáî ÔÞáâãßÝãî ßÐÜïâì' |
> iconv -t cp1252 | chardet
<stdin>: ISO-8859-5 (confidence: 0.99)

iconv -t cp1252 file | chardet解码文件并将其传递给chardet.

or iconv -t cp1252 file | chardet to decode a file and pass it to chardet.

(要使此命令在命令行上成功运行,您需要正确设置环境以进行透明的Unicode处理.我假设您的shell,终端和您的语言环境已正确配置.请尝试使用最新的Ubuntu Live CD或其他一些东西,如果您的常规环境停留在20世纪.)

(For this to work successfully at the command line, you need to have your environment properly set up for transparent Unicode handling. I am assuming that your shell, your terminal, and your locale are adequately configured. Try a recent Ubuntu Live CD or something if your regular environment is stuck in the 20th century.)

在一般情况下,您不知道错误应用的编码是 CP 1252 但实际上,我认为大多数时候它都是正确的(例如,在这种情况下会产生正确的结果).在最坏的情况下,您将必须遍历所有可用的旧式8位编码并全部尝试,然后查看chardet中具有最高置信度等级的编码.然后,上面的示例也将变得更加复杂-从旧版8位数据到UTF-8的映射将不再是简单的身份映射,而是还包含一个转换表(例如,字节F5可能任意对应于U + 0092或其他任何内容.

In the general case, you cannot know that the incorrectly applied encoding is CP 1252 but in practice, I guess it's going to be correct (as in, yield correct results for this scenario) most of the time. In the worst case, you would have to loop over all available legacy 8-bit encodings and try them all, then look at the one(s) with the highest confidence rating from chardet. Then, the example above will be more complex, too -- the mapping from legacy 8-bit data to UTF-8 will no longer be a simple identity mapping, but rather involve a translation table as well (for example, a byte F5 might correspond arbitrarily to U+0092 or whatever).

(顺便说一句,iconv -l会吐出一长串别名,因此如果将其用作输入,您将获得很多基本相同的结果.但是,这是快速临时修复固定稍微奇怪的Perl的尝试.脚本.

(Incidentally, iconv -l spits out a long list of aliases, so you will get a lot of fundamentally identical results if you use that as your input. But here is a quick ad-hoc attempt at fixing your slightly weird Perl script.

#!/bin/sh
iconv -l |
grep -F -v -e UTF -e EUC -e 2022 -e ISO646 -e GB2312 -e 5601 |
while read enc; do
    echo 'áÞåàÐÝØÒ ÜÝÞÓÞ ßàØÛÞÖÕÝØÙ Java, ÜÞÖÝÞ ×ÐÝïâì Òáî ÔÞáâãßÝãî ßÐÜïâì' |
    iconv -f utf-8 -t "${enc%//}" 2>/dev/null |
    chardet | sed "s%^[^:]*%${enc%//}%"
done |
grep -Fwive ascii -e utf -e euc -e 2022 -e None |
sort -k4rn

输出中仍然包含很多谷壳,但是一旦删除它们,结论就很简单了.

The output still contains a lot of chaff, but once you remove that, the verdict is straightforward.

在这种情况下尝试任何多字节编码(如UTF-16,ISO-2022,GB2312,EUC_KR等)都是没有意义的.如果将字符串成功转换为其中之一,则结果肯定会是该编码中的 .这超出了上面概述的问题的范围:使用错误的转换表将字符串从8位编码转换为UTF-8.

It makes no sense to try any multi-byte encodings such as UTF-16, ISO-2022, GB2312, EUC_KR etc in this scenario. If you convert a string into one of these successfully, then the result will most definitely be in that encoding. This is outside the scope of the problem outlined above: a string converted from an 8-bit encoding into UTF-8 using the wrong translation table.

返回ascii的人肯定做错了;它们中的大多数将收到空输入,因为iconv失败并显示错误.在Python脚本中,错误处理将更加简单.)

The ones which returned ascii definitely did something wrong; most of them will have received an empty input, because iconv failed with an error. In a Python script, error handling would be more straightforward.)

这篇关于检测UTF中的实际字符集编码的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆