如何识别UTF-8编码的字符串 [英] Howto identify UTF-8 encoded strings

查看:174
本文介绍了如何识别UTF-8编码的字符串的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

识别字符串(是或否)可能是UTF-8编码的最佳方式是什么? Win32 API IsTextUnicode 在这里没有什么帮助。此外,字符串将不具有UTF-8 BOM,因此无法检查。而且,是的,我知道只有ASCII范围以上的字符才能被编码超过1个字节。

解决方案

chardet 由Mozilla使用的FireFox开发的字符集检测。 源代码



jchardet 是来自mozilla的自动字符集检测算法的源码的java端口。



NCharDet 是一个。在Mozilla和FireFox浏览器中使用的C ++ Java端口的Net(C#)端口。



代码项目C#使用Microsoft的 MLang 进行字符编码检测。



UTRAC 是用c ++编写的命令行工具和库,用于检测字符串编码



cpdetector 是用于编码检测的delphi库



另一个指向大量图书馆的有用的帖子,以帮助您确定字符编码 http://fredeaker.blogspot.com/2007/01/character-encoding-detection.html



您还可以查看相关问题当BOM(字节顺序标记)缺失时,如何最好地猜测编码?,它有一些有用的内容。 p>

What's the best way to identify if a string (is or) might be UTF-8 encoded? The Win32 API IsTextUnicode isn't of much help here. Also, the string will not have an UTF-8 BOM, so that cannot be checked for. And, yes, I know that only characters above the ASCII range are encoded with more than 1 byte.

解决方案

chardet character set detection developed by Mozilla used in FireFox. Source code

jchardet is a java port of the source from mozilla's automatic charset detection algorithm.

NCharDet is a .Net (C#) port of a Java port of the C++ used in the Mozilla and FireFox browsers.

Code project C# sample that uses Microsoft's MLang for character encoding detection.

UTRAC is a command line tool and library written in c++ to detect string encoding

cpdetector is a delphi library used for encoding detection

Another useful post that points to a lot of libraries to help you determine character encoding http://fredeaker.blogspot.com/2007/01/character-encoding-detection.html

You could also take a look at the related question How Can I Best Guess the Encoding when the BOM (Byte Order Mark) is Missing?, it has some useful content.

这篇关于如何识别UTF-8编码的字符串的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆