如何识别UTF-8编码的字符串 [英] Howto identify UTF-8 encoded strings
问题描述
识别字符串(是或否)可能是UTF-8编码的最佳方式是什么? Win32 API IsTextUnicode
在这里没有什么帮助。此外,字符串将不具有UTF-8 BOM,因此无法检查。而且,是的,我知道只有ASCII范围以上的字符才能被编码超过1个字节。
chardet 由Mozilla使用的FireFox开发的字符集检测。 源代码
jchardet 是来自mozilla的自动字符集检测算法的源码的java端口。
NCharDet 是一个。在Mozilla和FireFox浏览器中使用的C ++ Java端口的Net(C#)端口。
代码项目C#使用Microsoft的 MLang 进行字符编码检测。
UTRAC 是用c ++编写的命令行工具和库,用于检测字符串编码
cpdetector 是用于编码检测的delphi库
另一个指向大量图书馆的有用的帖子,以帮助您确定字符编码 http://fredeaker.blogspot.com/2007/01/character-encoding-detection.html
您还可以查看相关问题当BOM(字节顺序标记)缺失时,如何最好地猜测编码?,它有一些有用的内容。 p>
What's the best way to identify if a string (is or) might be UTF-8 encoded? The Win32 API IsTextUnicode
isn't of much help here. Also, the string will not have an UTF-8 BOM, so that cannot be checked for. And, yes, I know that only characters above the ASCII range are encoded with more than 1 byte.
chardet character set detection developed by Mozilla used in FireFox. Source code
jchardet is a java port of the source from mozilla's automatic charset detection algorithm.
NCharDet is a .Net (C#) port of a Java port of the C++ used in the Mozilla and FireFox browsers.
Code project C# sample that uses Microsoft's MLang for character encoding detection.
UTRAC is a command line tool and library written in c++ to detect string encoding
cpdetector is a delphi library used for encoding detection
Another useful post that points to a lot of libraries to help you determine character encoding http://fredeaker.blogspot.com/2007/01/character-encoding-detection.html
You could also take a look at the related question How Can I Best Guess the Encoding when the BOM (Byte Order Mark) is Missing?, it has some useful content.
这篇关于如何识别UTF-8编码的字符串的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!