读取Unicode文件 [英] Read Unicode Files
问题描述
我正在编写一个unicode发行版本,我试图读取unicode文件,但数据有奇怪的字符,我似乎无法找到一种方法来将数据转换为ASCII。
我正在使用与fgets
。我尝试了 fgetws
, WideCharToMultiByte
,以及在其他文章和帖子中找到的很多函数,但是没有任何工作。
因为你提到WideCharToMultiByte,我会假定你正在处理Windows。
从unicode文件读取内容...找到一种将数据转换为ASCII的方法
这可能是一个问题。如果您将Unicode转换为ASCII(或其他遗留代码页),则可能会导致数据损坏/丢失。
因为你正在使用一个unicode发行版本,所以你需要阅读Unicode 并保持Unicode 。
缓冲区必须是 wchar_t
(或 WCHAR
或 CStringW
,同样的事情)。
因此你的文件可能是utf-16或utf-8(utf-32是相当罕见的)。
对于utf-16来说,永久性也可能很重要。如果有一个BOM会帮助很多。
快速步骤:
- 使用
wopen
或_wfopen
作为二进制打开文件阅读第一个字节来识别使用BOM编码如果编码是utf-8,则读取一个字节数组,并将其转换为wchar_t
,用
- code> WideCharToMultiByte 和
CP_UTF8
- 如果编码是utf-16be(big endian)请阅读
wchar_t
数组和_swab
- 如果编码为utf-在一个
wchar_t
数组中读入16le(小端)并完成操作
另外(如果您使用较新的Visual Studio),您可以利用MS扩展至
_wfopen
。它可以将编码作为模式的一部分(类似于_wfopen(Lnewfile.txt,Lrw,ccs =< encoding>);
编码为UTF-8或UTF-16LE)。它也可以检测基于BOM的编码。
警告:跨平台是有问题的,
wchar_t
可以是2或4字节,转换例程是不可移植的...
有用的链接:
- BOM(http://unicode.org/faq/utf_bom.html)
- wfopen(http:/ / /msdn.microsoft.com/en-us/library/yeby3zcb.aspx)
I have a problem reading and using the content from unicode files.
I am working on a unicode release build, and I am trying to read the content from an unicode file, but the data has strange characters and I can't seem to find a way to convert the data to ASCII.
I'm using
fgets
. I triedfgetws
,WideCharToMultiByte
, and a lot of functions which I found in other articles and posts, but nothing worked.解决方案Because you mention WideCharToMultiByte I will assume you are dealing with Windows.
"read the content from an unicode file ... find a way to convert data to ASCII"
This might be a problem. If you convert Unicode to ASCII (or other legacy code page) you will run into the risk of corrupting/losing data. Since you are "working on a unicode release build" you will want to read Unicode and stay Unicode.
So your final buffer will have to be
wchar_t
(orWCHAR
, orCStringW
, same thing).So your file might be utf-16, or utf-8 (utf-32 is quite rare). For utf-16 the endianess might also matter. If there is a BOM that will help a lot.
Quick steps:
- open file with
wopen
, or_wfopen
as binary - read the first bytes to identify encoding using the BOM
- if the encoding is utf-8, read in a byte array and convert to
wchar_t
withWideCharToMultiByte
andCP_UTF8
- if the encoding is utf-16be (big endian) read in a
wchar_t
array and_swab
- if the encoding is utf-16le (little endian) read in a
wchar_t
array and you are done
Also (if you use a newer Visual Studio), you might take advantage of an MS extension to
_wfopen
. It can take an encoding as part of the mode (something like_wfopen(L"newfile.txt", L"rw, ccs=<encoding>");
with the encoding being UTF-8 or UTF-16LE). It can also detect the encoding based on the BOM.Warning: to be cross-platform is problematic,
wchar_t
can be 2 or 4 bytes, the conversion routines are not portable...Useful links:
- BOM (http://unicode.org/faq/utf_bom.html)
- wfopen (http://msdn.microsoft.com/en-us/library/yeby3zcb.aspx)
这篇关于读取Unicode文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!
- code> WideCharToMultiByte 和