如何读取UCS-2文件? [英] How to read a UCS-2 file?
问题描述
我正在编写一个程序来获取UCS-2 Little Endian中* .rc文件编码的信息。 int _tmain(int argc,_TCHAR * argv []){
wstring csvLine(wstring sLine);
wifstream fin(en.rc);
wofstream fout(table.csv);
wofstream fout_rm(temp.txt);
wstring sLine;
fout<< en\\\
;
while(getline(fin,sLine)){
if(sLine.find(LIDS)== -1)
fout_rm<< sLine< ENDL;
else
fout<<< csvLine(SLINE);
}
fout<<冲洗;
system(pause);
return 0;
}
en.rc中的第一行是 #include< windows.h>
但 sLine
显示如下:
[0] 255 L'ÿ'
[1] 254 L'þ'
[2] 35 L'#'
[3] 0
[4] 105 L'i'
[5] 0
[6] 110 L'n'
[7] 0
[8] 99 L'c'
。 。
。 。
。 。
该程序可以正确地输出UTF-8。如何使用UCS-2?
宽流使用宽流缓冲区来访问文件。宽流缓冲区从文件读取字节,并使用其codecvt facet将这些字节转换为宽字符。默认的codecvt方面是 std :: codecvt< wchar_t,char,std :: mbstate_t>
,它将转换为 wchar_t $的本地字符集c $ c>和
char
(即,像 mbstowcs(
)一样)。
您没有使用本机字符集,所以你想要的是一个codecvt面,读取 UCS-2
作为多字节序列并将其转换为宽字符。
#include< fstream>
#include< string>
#include< codecvt>
#include< iostream>
int main(int argc,char * argv [])
{
wifstream fin(en.rc,std :: ios :: binary); //您需要以二进制模式打开该文件
//使用一个codecvt构面(使用UTF-16作为外部多字节编码)
fin.imbue(std :: locale(fin.getloc(),
new std :: codecvt_utf16< wchar_t,0xffff,consume_header>));
// ^我们将0xFFFF设置为maxcode,因为这是最大的适合单个wchar_t
//我们使用consume_header来检测和使用UTF-16'BOM'
//以下是不正确的方式来编写Unicode输出,但很容易
std :: wstring sLine;
std :: wstring_convert< std :: codecvt_utf8< wchar_t>,wchar_t>兑换;
while(getline(fin,sLine))
{
std :: cout<<< convert.to_bytes(sLine)<< \\\
;
}
}
请注意, UTF-16
这里。 wchar_t
的目的是为一个 wchar_t
表示一个代码点。然而Windows使用 UTF-16
代表一些代码点作为两个 wchar_t
s。这意味着标准API在Windows中不能很好地工作。
这里的结果是当文件包含代理对时, codecvt_utf16
会读取该对,将其转换为单个代码点值大于16位,并且必须将值截断为16位以将其粘贴到 wchar_t
中。这意味着这个代码真的限于 UCS-2
。我已经将maxcode模板参数设置为 0xFFFF
来反映这一点。
还有一些其他问题 wchar_t
,您可能希望完全避免:什么是错误 C ++ wchar_t?
I'm writing a program to get the infomation in *.rc file encoding in UCS-2 Little Endian.
int _tmain(int argc, _TCHAR* argv[]) {
wstring csvLine(wstring sLine);
wifstream fin("en.rc");
wofstream fout("table.csv");
wofstream fout_rm("temp.txt");
wstring sLine;
fout << "en\n";
while(getline(fin,sLine)) {
if (sLine.find(L"IDS") == -1)
fout_rm << sLine << endl;
else
fout << csvLine(sLine);
}
fout << flush;
system("pause");
return 0;
}
The first line in "en.rc" is #include <windows.h>
but sLine
shows as below:
[0] 255 L'ÿ'
[1] 254 L'þ'
[2] 35 L'#'
[3] 0
[4] 105 L'i'
[5] 0
[6] 110 L'n'
[7] 0
[8] 99 L'c'
. .
. .
. .
This program can work out correctly for UTF-8. How can I do it to UCS-2?
Wide streams use a wide stream buffer to access the file. The Wide stream buffer reads bytes from the file and uses its codecvt facet to convert these bytes to wide characters. The default codecvt facet is std::codecvt<wchar_t, char ,std::mbstate_t>
which converts between the native character sets for wchar_t
and char
(i.e., like mbstowcs(
) does).
You're not using the native char character set, so what you want is a codecvt facet that reads UCS-2
as a multibyte sequence and converts it to wide characters.
#include <fstream>
#include <string>
#include <codecvt>
#include <iostream>
int main(int argc, char *argv[])
{
wifstream fin("en.rc", std::ios::binary); // You need to open the file in binary mode
// Imbue the file stream with a codecvt facet that uses UTF-16 as the external multibyte encoding
fin.imbue(std::locale(fin.getloc(),
new std::codecvt_utf16<wchar_t, 0xffff, consume_header>));
// ^ We set 0xFFFF as the maxcode because that's the largest that will fit in a single wchar_t
// We use consume_header to detect and use the UTF-16 'BOM'
// The following is not really the correct way to write Unicode output, but it's easy
std::wstring sLine;
std::wstring_convert<std::codecvt_utf8<wchar_t>, wchar_t> convert;
while (getline(fin, sLine))
{
std::cout << convert.to_bytes(sLine) << '\n';
}
}
Note that there's an issue with UTF-16
here. The purpose of wchar_t
is for one wchar_t
to represent one codepoint. However Windows uses UTF-16
which represents some codepoints as two wchar_t
s. This means that the standard API doesn't work very well with Windows.
The consequence here is that when the file contains a surrogate pair, codecvt_utf16
will read that pair, convert it to a single codepoint value greater than 16 bits and have to truncate the value to 16 bits to stick it in a wchar_t
. This means this code really is limited to UCS-2
. I've set the maxcode template parameter to 0xFFFF
to reflect this.
There are a number of other problems with wchar_t
, and you might want to just avoid it entirely: What's "wrong" with C++ wchar_t?
这篇关于如何读取UCS-2文件?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!