读取Unicode文件 [英] Read Unicode Files

查看：162 发布时间：2017/11/3 19:14:20 c++ file unicode text

本文介绍了读取Unicode文件的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有一个阅读和使用unicode文件内容的问题。

我正在编写一个unicode发行版本，我试图读取unicode文件，但数据有奇怪的字符，我似乎无法找到一种方法来将数据转换为ASCII。

我正在使用与fgets 。我尝试了 fgetws ， WideCharToMultiByte ，以及在其他文章和帖子中找到的很多函数，但是没有任何工作。

解决方案

因为你提到WideCharToMultiByte，我会假定你正在处理Windows。

从unicode文件读取内容...找到一种将数据转换为ASCII的方法

这可能是一个问题。如果您将Unicode转换为ASCII（或其他遗留代码页），则可能会导致数据损坏/丢失。
因为你正在使用一个unicode发行版本，所以你需要阅读Unicode 并保持Unicode 。

缓冲区必须是 wchar_t （或 WCHAR 或 CStringW ，同样的事情）。

因此你的文件可能是utf-16或utf-8（utf-32是相当罕见的）。
对于utf-16来说，永久性也可能很重要。如果有一个BOM会帮助很多。

快速步骤：

使用 wopen 或 _wfopen 作为二进制打开文件阅读第一个字节来识别使用BOM编码如果编码是utf-8，则读取一个字节数组，并将其转换为 wchar_t ，用
code> WideCharToMultiByte 和 CP_UTF8

如果编码是utf-16be（big endian）请阅读 wchar_t 数组和 _swab

如果编码为utf-在一个 wchar_t 数组中读入16le（小端）并完成操作

另外（如果您使用较新的Visual Studio），您可以利用MS扩展至 _wfopen 。它可以将编码作为模式的一部分（类似于 _wfopen（Lnewfile.txt，Lrw，ccs =< encoding>）; 编码为UTF-8或UTF-16LE）。它也可以检测基于BOM的编码。

警告：跨平台是有问题的， wchar_t 可以是2或4字节，转换例程是不可移植的...

有用的链接：

BOM（http://unicode.org/faq/utf_bom.html)

wfopen（http：/ / /msdn.microsoft.com/en-us/library/yeby3zcb.aspx）

I have a problem reading and using the content from unicode files.

I am working on a unicode release build, and I am trying to read the content from an unicode file, but the data has strange characters and I can't seem to find a way to convert the data to ASCII.

I'm using fgets. I tried fgetws, WideCharToMultiByte, and a lot of functions which I found in other articles and posts, but nothing worked.
解决方案
Because you mention WideCharToMultiByte I will assume you are dealing with Windows.

"read the content from an unicode file ... find a way to convert data to ASCII"

This might be a problem. If you convert Unicode to ASCII (or other legacy code page) you will run into the risk of corrupting/losing data. Since you are "working on a unicode release build" you will want to read Unicode and stay Unicode.

So your final buffer will have to be wchar_t (or WCHAR, or CStringW, same thing).

So your file might be utf-16, or utf-8 (utf-32 is quite rare). For utf-16 the endianess might also matter. If there is a BOM that will help a lot.

Quick steps:

open file with wopen, or _wfopen as binary

read the first bytes to identify encoding using the BOM

if the encoding is utf-8, read in a byte array and convert to wchar_t with WideCharToMultiByte and CP_UTF8

if the encoding is utf-16be (big endian) read in a wchar_t array and _swab

if the encoding is utf-16le (little endian) read in a wchar_t array and you are done

Also (if you use a newer Visual Studio), you might take advantage of an MS extension to _wfopen. It can take an encoding as part of the mode (something like _wfopen(L"newfile.txt", L"rw, ccs=<encoding>"); with the encoding being UTF-8 or UTF-16LE). It can also detect the encoding based on the BOM.

Warning: to be cross-platform is problematic, wchar_t can be 2 or 4 bytes, the conversion routines are not portable...

Useful links:

BOM (http://unicode.org/faq/utf_bom.html)

wfopen (http://msdn.microsoft.com/en-us/library/yeby3zcb.aspx)

这篇关于读取Unicode文件的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

读取Unicode文件 [英] Read Unicode Files

问题描述

相关文章

C/C++开发最新文章

热门教程

热门工具

登录关闭

读取Unicode文件 [英] Read Unicode Files

问题描述

相关文章

C/C++开发最新文章

热门教程

热门工具

登录 关闭

登录关闭