检测文本文件编码 [英] Detect text file encoding

查看：102 发布时间：2020/10/1 0:37:05 c++ qt character-encoding

本文介绍了检测文本文件编码的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

在我的程序中，我加载用户提供的纯文本文件：

  QFile file（fileName）; 
 file.open（QIODevice :: ReadOnly）; 
 QTextStream stream（& file）; 
 const QString& text = stream.readAll（）;

当文件采用UTF-8编码时，此方法可以正常工作，但某些用户尝试导入Windows-1252编码文件，如果它们带有特殊字符的单词（例如boutonnière中的è），则这些单词将显示不正确。

有没有办法检测到编码，或者至少在不要求用户告诉我编码的情况下区分UTF-8（可能没有BOM）和Windows-1252？

解决方案

结果表明，一般情况下无法自动检测编码。

但是，有一种解决方法至少可以回退到系统语言环境（如果文本无效）为UTF-8 / UTF-16 / UTF-32文本。它使用 QTextCodec :: codecForUtfText（）尝试使用UTF-8，UTF-16和UTF-32解码字节数组，并返回提供的默认编解码器（如果有）

执行此操作的代码：

  QTextCodec * codec = QTextCodec :: codecForUtfText（byteArray，QTextCodec :: codecForName（ System）））; 
 const QString& text = codec-> toUnicode（byteArray）;

更新

<上面的代码将在没有BOM的情况下检测不到UTF-8，因为codecForUtfText（）依赖BOM标记。要检测没有BOM的UTF-8，请参阅 https://stackoverflow.com/a/18228382/492336 。

In my program I load plain text files supplied by the user:

QFile file(fileName);
file.open(QIODevice::ReadOnly);
QTextStream stream(&file);
const QString &text = stream.readAll();

This works fine when the files are UTF-8 encoded, but some users try to import Windows-1252 encoded files, and if they have words with special characters (for example "è" in "boutonnière"), those will show incorrectly.

Is there a way to detect the encoding, or at least distinguish between UTF-8 (possibly without BOM), and Windows-1252, without asking the user to tell me the encoding?

解决方案

Turns out that auto-detecting the encoding is impossible for the general case.

However, there is a workaround to at least fall back to the system locale if the text is not valid UTF-8/UTF-16/UTF-32 text. It uses QTextCodec::codecForUtfText(), which tries to decode a byte array using UTF-8, UTF-16 and UTF-32, and returns the supplied default codec if it fails.

Code to do it:

QTextCodec *codec = QTextCodec::codecForUtfText(byteArray, QTextCodec::codecForName("System"));
const QString &text = codec->toUnicode(byteArray);

Update

The above code will not detect UTF-8 without BOM, however, as codecForUtfText() relies on the BOM markers. To detect UTF-8 without BOM, see https://stackoverflow.com/a/18228382/492336.

这篇关于检测文本文件编码的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

检测文本文件编码 [英] Detect text file encoding

问题描述

相关文章

C/C++开发最新文章

热门教程

热门工具

登录关闭

检测文本文件编码 [英] Detect text file encoding

问题描述

相关文章

C/C++开发最新文章

热门教程

热门工具

登录 关闭

登录关闭