当BOM(字节订单标记)丢失时,如何最好地猜测编码? [英] How Can I Best Guess the Encoding when the BOM (Byte Order Mark) is Missing?

查看:213
本文介绍了当BOM(字节订单标记)丢失时,如何最好地猜测编码?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我的程序必须读取使用各种编码的文件。它们可以是ANSI,UTF-8或UTF-16(大或小端)。



当BOM(字节订单标记)在那里时,我没有问题。我知道文件是UTF-8还是UTF-16 BE或LE。



当没有BOM时,我想假设文件是​​ANSI。但是我发现我正在处理的文件常常丢失了BOM。因此,没有BOM可能意味着文件是ANSI,UTF-8,UTF-16 BE或LE。



当文件没有BOM时,扫描一些文件并最准确地猜测编码类型的最佳方式是什么?如果文件是ANSI,并且在90年代,如果是UTF格式,我想要接近100%的时间。



我正在寻找一种通用的算法方法来确定这一点。但是我实际上使用Delphi 2009,它知道Unicode,并且有一个TEncoding类,所以特定的东西将是一个奖励。






答案:



ShreevatsaR的答案让我在Google上搜索通用编码检测器delphi,令我惊讶的是,只有在这个帖子被列为第1位之后,约45分钟!这是googlebotting快!还有,令人惊奇的是,Stackoverflow如此快速地进入了第一名。



Google的第二个条目是Fred Eaker在字符编码检测列出各种语言的算法。



我发现在该页面上提到了Delphi,它引导我直接 SourceForge上的免费OpenSource ChsDet Charset检测器写在德尔福,并基于Mozilla的i18n组件。



太棒了!谢谢所有回答的人(全部+1),谢谢ShreevatsaR,再次感谢你Stackoverflow,帮助我在不到一个小时内找到我的答案!

解决方案

也许你可以使用一个Python脚本,它使用 Chardet:通用编码检测器。这是Firefox使用的字符编码检测的重新实现,由许多不同的应用程序有用的链接: Mozilla的代码研究论文它是基于(讽刺的是,我的Firefox无法正确检测该页面的编码),< a href =http://chardet.feedparser.org/docs/faq.html#faq.impossible =noreferrer>简短说明,详细说明


My program has to read files that use various encodings. They may be ANSI, UTF-8 or UTF-16 (big or little endian).

When the BOM (Byte Order Mark) is there, I have no problem. I know if the file is UTF-8 or UTF-16 BE or LE.

I wanted to assume when there was no BOM that the file was ANSI. But I have found that the files I am dealing with often are missing their BOM. Therefore no BOM may mean that the file is ANSI, UTF-8, UTF-16 BE or LE.

When the file has no BOM, what would be the best way to scan some of the file and most accurately guess the type of encoding? I'd like to be right close to 100% of the time if the file is ANSI and in the high 90's if it is a UTF format.

I'm looking for a generic algorithmic way to determine this. But I actually use Delphi 2009 which knows Unicode and has a TEncoding class, so something specific to that would be a bonus.


Answer:

ShreevatsaR's answer led me to search on Google for "universal encoding detector delphi" which surprised me in having this post listed in #1 position after being alive for only about 45 minutes! That is fast googlebotting!! And also amazing that Stackoverflow gets into 1st place so quickly.

The 2nd entry in Google was a blog entry by Fred Eaker on Character encoding detection that listed algorithms in various languages.

I found the mention of Delphi on that page, and it led me straight to the Free OpenSource ChsDet Charset Detector at SourceForge written in Delphi and based on Mozilla's i18n component.

Fantastic! Thank you all those who answered (all +1), thank you ShreevatsaR, and thank you again Stackoverflow, for helping me find my answer in less than an hour!

解决方案

Maybe you can shell out to a Python script that uses Chardet: Universal Encoding Detector. It is a reimplementation of the character encoding detection that used by Firefox, and is used by many different applications. Useful links: Mozilla's code, research paper it was based on (ironically, my Firefox fails to correctly detect the encoding of that page), short explanation, detailed explanation.

这篇关于当BOM(字节订单标记)丢失时,如何最好地猜测编码?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆