当 BOM(字节顺序标记)丢失时,我如何最好地猜测编码? [英] How Can I Best Guess the Encoding when the BOM (Byte Order Mark) is Missing?

查看:27
本文介绍了当 BOM(字节顺序标记)丢失时,我如何最好地猜测编码?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我的程序必须读取使用各种编码的文件.它们可能是 ANSI、UTF-8 或 UTF-16(大端或小端).

My program has to read files that use various encodings. They may be ANSI, UTF-8 or UTF-16 (big or little endian).

当 BOM(字节顺序标记)在那里时,我没有问题.我知道文件是 UTF-8 还是 UTF-16 BE 或 LE.

When the BOM (Byte Order Mark) is there, I have no problem. I know if the file is UTF-8 or UTF-16 BE or LE.

我想在没有 BOM 时假设文件是​​ ANSI.但是我发现我经常处理的文件缺少它们的 BOM.因此,没有 BOM 可能意味着文件是 ANSI、UTF-8、UTF-16 BE 或 LE.

I wanted to assume when there was no BOM that the file was ANSI. But I have found that the files I am dealing with often are missing their BOM. Therefore no BOM may mean that the file is ANSI, UTF-8, UTF-16 BE or LE.

当文件没有 BOM 时,扫描文件的某些部分并最准确地猜测编码类型的最佳方法是什么?如果文件是 ANSI,我想接近 100% 的时间,如果它是 UTF 格式,则在 90 年代.

When the file has no BOM, what would be the best way to scan some of the file and most accurately guess the type of encoding? I'd like to be right close to 100% of the time if the file is ANSI and in the high 90's if it is a UTF format.

我正在寻找一种通用的算法方法来确定这一点.但我实际上使用了 Delphi 2009,它知道 Unicode 并且有一个 TEncoding 类,所以一些特定的东西会是一个奖励.

I'm looking for a generic algorithmic way to determine this. But I actually use Delphi 2009 which knows Unicode and has a TEncoding class, so something specific to that would be a bonus.

答案:

ShreevatsaR 的回答让我在 Google 上搜索通用编码检测器 delphi",这让我惊讶地发现这篇文章在仅存活约 45 分钟后就被列为第一名!那是快速的 googlebotting !!同样令人惊讶的是,Stackoverflow 如此迅速地获得了第一名.

ShreevatsaR's answer led me to search on Google for "universal encoding detector delphi" which surprised me in having this post listed in #1 position after being alive for only about 45 minutes! That is fast googlebotting!! And also amazing that Stackoverflow gets into 1st place so quickly.

Google 中的第二个条目是 Fred Eaker 在 Character 上的博客条目编码检测列出了各种语言的算法.

The 2nd entry in Google was a blog entry by Fred Eaker on Character encoding detection that listed algorithms in various languages.

我在该页面上发现了对 Delphi 的提及,它让我直接找到了SourceForge 上的免费开源 ChsDet 字符集检测器 用 Delphi 编写,基于 Mozilla 的 i18n 组件.

I found the mention of Delphi on that page, and it led me straight to the Free OpenSource ChsDet Charset Detector at SourceForge written in Delphi and based on Mozilla's i18n component.

太棒了!感谢所有回答的人(全部 +1),感谢 ShreevatsaR,再次感谢 Stackoverflow,帮助我在不到一个小时内找到答案!

Fantastic! Thank you all those who answered (all +1), thank you ShreevatsaR, and thank you again Stackoverflow, for helping me find my answer in less than an hour!

推荐答案

也许你可以使用 的 Python 脚本Chardet:通用编码检测器.它是对 Firefox 使用的字符编码检测的重新实现,被 许多不同的应用程序使用.有用的链接:Mozilla 的代码研究论文 它基于(具有讽刺意味的是,我的 Firefox 未能正确检测该页面的编码),简短说明详细说明.

Maybe you can shell out to a Python script that uses Chardet: Universal Encoding Detector. It is a reimplementation of the character encoding detection that used by Firefox, and is used by many different applications. Useful links: Mozilla's code, research paper it was based on (ironically, my Firefox fails to correctly detect the encoding of that page), short explanation, detailed explanation.

这篇关于当 BOM(字节顺序标记)丢失时,我如何最好地猜测编码?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆