MSVC上的双字节编码（std :: codecvt）：无法识别导联字节 [英] Doublebyte encodings on MSVC (std::codecvt): Lead bytes not recognized

查看：155 发布时间：2016/10/25 13:30:10 c++ windows visual-c++ unicode character-encoding

本文介绍了MSVC上的双字节编码（std :: codecvt）：无法识别导联字节的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我想使用 std :: codecvt< wchar_t，char，std :: mbstate_t> :: in（）。例如，请考虑以下程序：

I want to convert a string encoded in a doublebyte code page into an UTF-16 string using std::codecvt<wchar_t, char, std::mbstate_t>::in() on the Microsoft standard library implementation (MSVC11). For example, consider the following program:

#include <iostream>
#include <locale>

int main()
{
    // KATAKANA LETTER A (U+30A2) in Shift-JIS (Codepage 932)
    // http://msdn.microsoft.com/en-us/goglobal/cc305152
    char const cs[] = "\x83\x41";

    std::locale loc = std::locale("Japanese");

    // Output: "Japanese_Japan.932" (as expected)
    std::cout << loc.name() << '\n';

    typedef std::codecvt<wchar_t, char, std::mbstate_t> cvt_t;
    cvt_t const& codecvt = std::use_facet<cvt_t>(loc);
    wchar_t out = 0;
    std::mbstate_t mbst = std::mbstate_t();
    char const* mid;
    wchar_t* outmid;

    // Output: "2" (error) (expected: "0" (ok))
    std::cout << codecvt.in(
        mbst, cs,   cs + 2,   mid,
              &out, &out + 1, outmid) << '\n';

    // Output: "0" (expected: "30a2")
    std::cout << std::hex << out << '\n';
}

当调试时，我发现最后调用内部的 _Mbrtowc（）函数（crt \src\xmbtowc.c），传递内部 std :: locale ，初始化为 {_ Page = 932 _Mbcurmax = 2 _Isclocale = 0 ...} ...代表（并且这似乎是问题） _Isleadbyte 成员，初始化为32个零（unsigned char类型）的数组。因此，当函数处理'\x32'超前字节时，它检查这个数组，并自然得出这个结果是错误的： / em>引导字节。所以它很高兴地调用 MultiByteToWideChar（） Win-API函数，当然，转换半角字符失败。因此， _Mbrtowc（）返回错误代码-1，它或多或少地取消调用堆栈中的所有内容，最终取消2（ std :: codecvt_base :: result :: error ）。

When debugging, I found out that in() ends up calling the internal _Mbrtowc() function (crt\src\xmbtowc.c), passing the internal (C?) part of the std::locale, initialized with {_Page=932 _Mbcurmax=2 _Isclocale=0 ...}, where ... stands for (and this seems to be the problem) the _Isleadbyte member, initialized to an array of 32 zeros (of type unsigned char). Thus, when the function processes the '\x32' lead byte, it checks with this array and naturally comes to the (wrong) conclusion that this is not a lead byte. So it happily calls the MultiByteToWideChar() Win-API function, which, of course, fails to convert the halfed character. So, _Mbrtowc() returns the error code -1, which more or less cancels everything up the call stack and ultimately the 2 (std::codecvt_base::result::error) is returned.

这是MS标准库中的一个错误吗？（如何）以便携式方式解决这个问题（即最少量的 #ifdef ）？

Is this a bug in the MS standard library (it seems so)? (How) can I work around this in a portable way (i.e. with the least amount of #ifdefs)?

MSVC上的双字节编码（std :: codecvt）：无法识别导联字节 [英] Doublebyte encodings on MSVC (std::codecvt): Lead bytes not recognized

问题描述

推荐答案

相关文章

C/C++开发最新文章

热门教程

热门工具

登录关闭

MSVC上的双字节编码（std :: codecvt）：无法识别导联字节 [英] Doublebyte encodings on MSVC (std::codecvt): Lead bytes not recognized

问题描述

推荐答案

相关文章

C/C++开发最新文章

热门教程

热门工具

登录 关闭

登录关闭