std::wifstream::getline 对我的 wchar_t 数组做了什么?在 getline 返回后,它被视为一个字节数组 [英] What is std::wifstream::getline doing to my wchar_t array? It's treated like a byte array after getline returns

查看:46
本文介绍了std::wifstream::getline 对我的 wchar_t 数组做了什么?在 getline 返回后,它被视为一个字节数组的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想从文件中读取 Unicode 文本行(UTF-16 LE,换行分隔).我使用的是 Visual Studio 2012 并针对 32 位控制台应用程序.

我在 WinAPI 中找不到 ReadLine 函数,所以我求助于 Google.很明显,我不是第一个寻求这种功能的人.最常推荐的解决方案是使用 std::wifstream.

我写了类似下面的代码:

wchar_t 缓冲区[1024];std::wifstream input(L"input.txt");而 (input.good()){输入::getline(缓冲区,1024);//... 做东西...}input.close();

为了便于解释,假设 input.txt 包含两个长度小于 200 wchar_t 字符的 UTF-16 LE 行.

在第一次调用 getline 之前,Visual Studio 正确识别缓冲区是一个 wchar_t 数组.您可以将鼠标悬停在调试器中的变量上,并看到该数组由 16 位值组成.但是,在调用 getline 返回后,调试器现在将缓冲区显示为一个字节数组.

在第一次调用 getline 之后,buffer 的内容是正确的(除了 buffer 被视为一个字节数组).如果 input.txt 的第一行包含 UTF-16 字符串 L"123",则将其正确存储在缓冲区中为 (hex) "31 00 32 00 33 00"

我的第一个想法是 reinterpret_cast<wchar_t *>(buffer) 它确实产生了所需的结果(缓冲区现在被视为 wchar_t 数组)并且它包含我期望的值.>

但是,在第二次调用 getline 之后,(input.txt 的第二行包含字符串 L"456")缓冲区包含(十六进制)"00 34 00 35 00 36 00".请注意,这是不正确的(应该是 [hex] 34 00 35 00 36 00)

字节顺序混乱的事实使我无法使用 reinterpret_cast 作为解决此问题的解决方案.更重要的是,为什么 std::wifstream::getline 甚至将我的 wchar_t 缓冲区转换为 char 缓冲区?我的印象是,如果有人想使用字符,他们会使用 ifstream,如果他们想使用 wchar_t,他们会使用 wifstream...

我对 stl 标头的理解很糟糕,但看起来好像 wifstream 有意将我的 wchar_t 转换为字符......为什么??

如果您能提供任何有助于理解这些问题的见解和解释,我将不胜感激.

解决方案

wifstream 从文件中读取字节,并使用安装到文件中的 codecvt facet 将它们转换为宽字符流的语言环境.默认方面采用系统默认代码页并在这些字节上调用 mbstowcs.

要将您的文件视为 UTF-16,您需要使用 codecvt_utf16.像这样:

std::wifstream fin("text.txt", std::ios::binary);//应用面fin.imbue(std::locale(fin.getloc(),新的 std::codecvt_utf16));

I want to read lines of Unicode text (UTF-16 LE, line feed delimited) from a file. I'm using Visual Studio 2012 and targeting a 32-bit console application.

I was not able to find a ReadLine function within WinAPI so I turned to Google. It is clear I am not the first to seek such a function. The most commonly recommended solution involves using std::wifstream.

I wrote code similar to the following:

wchar_t buffer[1024];
std::wifstream input(L"input.txt");

while (input.good())
{
    input::getline(buffer, 1024);
    // ... do stuff...
}

input.close();

For the sake of explanation, assume that input.txt contains two UTF-16 LE lines which are less than 200 wchar_t chars in length.

Prior to calling getline the first time, Visual Studio correctly identifies that buffer is an array of wchar_t. You can mouse over the variable in the debugger and see that the array is comprised of 16-bit values. However, after the call to getline returns, the debugger now displays buffer as if is a byte array.

After the first call to getline, the contents of buffer are correct (aside from buffer being treated like a byte array). If the first line of input.txt contains the UTF-16 string L"123", this is correctly stored in buffer as (hex) "31 00 32 00 33 00"

My first thought was to reinterpret_cast<wchar_t *>(buffer) which does produce the desired result (buffer is now treated like a wchar_t array) and it contains the values I expect.

However, after the second call to getline, (the second line of input.txt contains the string L"456") buffer contains (hex) "00 34 00 35 00 36 00". Note that this is incorrect (it should be [hex] 34 00 35 00 36 00)

The fact that the byte ordering gets messed up prevents me from using reinterpret_cast as a solution to work around this. More importantly, why is std::wifstream::getline even converting my wchar_t buffer into a char buffer anyways?? I was under the impression that if one wanted to use chars they would use ifstream and if they want to use wchar_t they use wifstream...

I am terrible at making sense of the stl headers, but it almost looks as if wifstream is intentionally converting my wchar_t to a char... why??

I would appreciate any insights and explanations for understanding these problems.

解决方案

wifstream reads bytes from the file, and converts them to wide chars using codecvt facet installed into the stream's locale. The default facet assumes system-default code page and calls mbstowcs on those bytes.

To treat your file as UTF-16, you need to use codecvt_utf16. Like this:

std::wifstream fin("text.txt", std::ios::binary);
// apply facet
fin.imbue(std::locale(fin.getloc(),
          new std::codecvt_utf16<wchar_t, 0x10ffff, std::little_endian>));

这篇关于std::wifstream::getline 对我的 wchar_t 数组做了什么?在 getline 返回后,它被视为一个字节数组的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆