如何读取UCS-2文件? [英] How to read a UCS-2 file?

查看:516
本文介绍了如何读取UCS-2文件?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

 我正在编写一个程序来获取UCS-2 Little Endian中* .rc文件编码的信息。 int _tmain(int argc,_TCHAR * argv []){
wstring csvLine(wstring sLine);
wifstream fin(en.rc);
wofstream fout(table.csv);
wofstream fout_rm(temp.txt);
wstring sLine;
fout<< en\\\
;
while(getline(fin,sLine)){
if(sLine.find(LIDS)== -1)
fout_rm<< sLine< ENDL;
else
fout<<< csvLine(SLINE);
}
fout<<冲洗;
system(pause);
return 0;
}

en.rc中的第一行是 #include< windows.h> sLine 显示如下:

  [0] 255 L'ÿ'
[1] 254 L'þ'
[2] 35 L'#'
[3] 0
[4] 105 L'i'
[5] 0
[6] 110 L'n'
[7] 0
[8] 99 L'c'
。 。
。 。
。 。

该程序可以正确地输出UTF-8。如何使用UCS-2?

解决方案

宽流使用宽流缓冲区来访问文件。宽流缓冲区从文件读取字节,并使用其codecvt facet将这些字节转换为宽字符。默认的codecvt方面是 std :: codecvt< wchar_t,char,std :: mbstate_t> ,它将转换为 wchar_t char (即,像 mbstowcs()一样)。



您没有使用本机字符集,所以你想要的是一个codecvt面,读取 UCS-2 作为多字节序列并将其转换为宽字符。

  #include< fstream> 
#include< string>
#include< codecvt>
#include< iostream>

int main(int argc,char * argv [])
{
wifstream fin(en.rc,std :: ios :: binary); //您需要以二进制模式打开该文件

//使用一个codecvt构面(使用UTF-16作为外部多字节编码)
fin.imbue(std :: locale(fin.getloc(),
new std :: codecvt_utf16< wchar_t,0xffff,consume_header>));

// ^我们将0xFFFF设置为maxcode,因为这是最大的适合单个wchar_t
//我们使用consume_header来检测和使用UTF-16'BOM'

//以下是不正确的方式来编写Unicode输出,但很容易
std :: wstring sLine;
std :: wstring_convert< std :: codecvt_utf8< wchar_t>,wchar_t>兑换;
while(getline(fin,sLine))
{
std :: cout<<< convert.to_bytes(sLine)<< \\\
;
}
}

请注意, UTF-16 这里。 wchar_t 的目的是为一个 wchar_t 表示一个代码点。然而Windows使用 UTF-16 代表一些代码点作为两个 wchar_t s。这意味着标准API在Windows中不能很好地工作。



这里的结果是当文件包含代理对时, codecvt_utf16 会读取该对,将其转换为单个代码点值大于16位,并且必须将值截断为16位以将其粘贴到 wchar_t 中。这意味着这个代码真的限于 UCS-2 。我已经将maxcode模板参数设置为 0xFFFF 来反映这一点。



还有一些其他问题 wchar_t ,您可能希望完全避免:什么是错误 C ++ wchar_t?


I'm writing a program to get the infomation in *.rc file encoding in UCS-2 Little Endian.

int _tmain(int argc, _TCHAR* argv[]) {
  wstring csvLine(wstring sLine);
  wifstream fin("en.rc");
  wofstream fout("table.csv");
  wofstream fout_rm("temp.txt");
  wstring sLine;
  fout << "en\n";
  while(getline(fin,sLine)) {
    if (sLine.find(L"IDS") == -1)
      fout_rm << sLine << endl;
    else
      fout << csvLine(sLine);
  }
  fout << flush;
  system("pause");
  return 0;
}

The first line in "en.rc" is #include <windows.h> but sLine shows as below:

[0]     255 L'ÿ'
[1]     254 L'þ'
[2]     35  L'#'
[3]     0
[4]     105 L'i'
[5]     0
[6]     110 L'n'
[7]     0
[8]     99  L'c'
.       .
.       .
.       .

This program can work out correctly for UTF-8. How can I do it to UCS-2?

解决方案

Wide streams use a wide stream buffer to access the file. The Wide stream buffer reads bytes from the file and uses its codecvt facet to convert these bytes to wide characters. The default codecvt facet is std::codecvt<wchar_t, char ,std::mbstate_t> which converts between the native character sets for wchar_t and char (i.e., like mbstowcs() does).

You're not using the native char character set, so what you want is a codecvt facet that reads UCS-2 as a multibyte sequence and converts it to wide characters.

#include <fstream>
#include <string>
#include <codecvt>
#include <iostream>

int main(int argc, char *argv[])
{
    wifstream fin("en.rc", std::ios::binary); // You need to open the file in binary mode

    // Imbue the file stream with a codecvt facet that uses UTF-16 as the external multibyte encoding
    fin.imbue(std::locale(fin.getloc(),
              new std::codecvt_utf16<wchar_t, 0xffff, consume_header>));

    // ^ We set 0xFFFF as the maxcode because that's the largest that will fit in a single wchar_t
    //   We use consume_header to detect and use the UTF-16 'BOM'

    // The following is not really the correct way to write Unicode output, but it's easy
    std::wstring sLine;
    std::wstring_convert<std::codecvt_utf8<wchar_t>, wchar_t> convert;
    while (getline(fin, sLine))
    {
        std::cout << convert.to_bytes(sLine) << '\n';
    }
}

Note that there's an issue with UTF-16 here. The purpose of wchar_t is for one wchar_t to represent one codepoint. However Windows uses UTF-16 which represents some codepoints as two wchar_ts. This means that the standard API doesn't work very well with Windows.

The consequence here is that when the file contains a surrogate pair, codecvt_utf16 will read that pair, convert it to a single codepoint value greater than 16 bits and have to truncate the value to 16 bits to stick it in a wchar_t. This means this code really is limited to UCS-2. I've set the maxcode template parameter to 0xFFFF to reflect this.

There are a number of other problems with wchar_t, and you might want to just avoid it entirely: What's "wrong" with C++ wchar_t?

这篇关于如何读取UCS-2文件?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆