读取Unicode文件 [英] Read Unicode Files

查看:162
本文介绍了读取Unicode文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个阅读和使用unicode文件内容的问题。



我正在编写一个unicode发行版本,我试图读取unicode文件,但数据有奇怪的字符,我似乎无法找到一种方法来将数据转换为ASCII。

我正在使用与fgets 。我尝试了 fgetws WideCharToMultiByte ,以及在其他文章和帖子中找到的很多函数,但是没有任何工作。

解决方案

因为你提到WideCharToMultiByte,我会假定你正在处理Windows。


从unicode文件读取内容...找到一种将数据转换为ASCII的方法




这可能是一个问题。如果您将Unicode转换为ASCII(或其他遗留代码页),则可能会导致数据损坏/丢失。
因为你正在使用一个unicode发行版本,所以你需要阅读Unicode 并保持Unicode

缓冲区必须是 wchar_t (或 WCHAR CStringW ,同样的事情)。

因此你的文件可能是utf-16或utf-8(utf-32是相当罕见的)。
对于utf-16来说,永久性也可能很重要。如果有一个BOM会帮助很多。



快速步骤:


  • 使用 wopen _wfopen 作为二进制打开文件阅读第一个字节来识别使用BOM编码如果编码是utf-8,则读取一个字节数组,并将其转换为 wchar_t ,用
  • code> WideCharToMultiByte 和 CP_UTF8

  • 如果编码是utf-16be(big endian)请阅读 wchar_t 数组和 _swab

  • 如果编码为utf-在一个 wchar_t 数组中读入16le(小端)并完成操作


    另外(如果您使用较新的Visual Studio),您可以利用MS扩展至 _wfopen 。它可以将编码作为模式的一部分(类似于 _wfopen(Lnewfile.txt,Lrw,ccs =< encoding>); 编码为UTF-8或UTF-16LE)。它也可以检测基于BOM的编码。



    警告:跨平台是有问题的, wchar_t 可以是2或4字节,转换例程是不可移植的...

    有用的链接:


    I have a problem reading and using the content from unicode files.

    I am working on a unicode release build, and I am trying to read the content from an unicode file, but the data has strange characters and I can't seem to find a way to convert the data to ASCII.

    I'm using fgets. I tried fgetws, WideCharToMultiByte, and a lot of functions which I found in other articles and posts, but nothing worked.

    解决方案

    Because you mention WideCharToMultiByte I will assume you are dealing with Windows.

    "read the content from an unicode file ... find a way to convert data to ASCII"

    This might be a problem. If you convert Unicode to ASCII (or other legacy code page) you will run into the risk of corrupting/losing data. Since you are "working on a unicode release build" you will want to read Unicode and stay Unicode.

    So your final buffer will have to be wchar_t (or WCHAR, or CStringW, same thing).

    So your file might be utf-16, or utf-8 (utf-32 is quite rare). For utf-16 the endianess might also matter. If there is a BOM that will help a lot.

    Quick steps:

    • open file with wopen, or _wfopen as binary
    • read the first bytes to identify encoding using the BOM
    • if the encoding is utf-8, read in a byte array and convert to wchar_t with WideCharToMultiByte and CP_UTF8
    • if the encoding is utf-16be (big endian) read in a wchar_t array and _swab
    • if the encoding is utf-16le (little endian) read in a wchar_t array and you are done

    Also (if you use a newer Visual Studio), you might take advantage of an MS extension to _wfopen. It can take an encoding as part of the mode (something like _wfopen(L"newfile.txt", L"rw, ccs=<encoding>"); with the encoding being UTF-8 or UTF-16LE). It can also detect the encoding based on the BOM.

    Warning: to be cross-platform is problematic, wchar_t can be 2 or 4 bytes, the conversion routines are not portable...

    Useful links:

    这篇关于读取Unicode文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆