MSVC上的双字节编码(std :: codecvt):无法识别导联字节 [英] Doublebyte encodings on MSVC (std::codecvt): Lead bytes not recognized

查看:155
本文介绍了MSVC上的双字节编码(std :: codecvt):无法识别导联字节的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想使用 std :: codecvt< wchar_t,char,std :: mbstate_t> :: in() 。例如,请考虑以下程序:

I want to convert a string encoded in a doublebyte code page into an UTF-16 string using std::codecvt<wchar_t, char, std::mbstate_t>::in() on the Microsoft standard library implementation (MSVC11). For example, consider the following program:

#include <iostream>
#include <locale>

int main()
{
    // KATAKANA LETTER A (U+30A2) in Shift-JIS (Codepage 932)
    // http://msdn.microsoft.com/en-us/goglobal/cc305152
    char const cs[] = "\x83\x41";

    std::locale loc = std::locale("Japanese");

    // Output: "Japanese_Japan.932" (as expected)
    std::cout << loc.name() << '\n';

    typedef std::codecvt<wchar_t, char, std::mbstate_t> cvt_t;
    cvt_t const& codecvt = std::use_facet<cvt_t>(loc);
    wchar_t out = 0;
    std::mbstate_t mbst = std::mbstate_t();
    char const* mid;
    wchar_t* outmid;

    // Output: "2" (error) (expected: "0" (ok))
    std::cout << codecvt.in(
        mbst, cs,   cs + 2,   mid,
              &out, &out + 1, outmid) << '\n';

    // Output: "0" (expected: "30a2")
    std::cout << std::hex << out << '\n';
}



当调试时,我发现 最后调用内部的 _Mbrtowc()函数(crt \src\xmbtowc.c),传递内部 std :: locale ,初始化为 {_ Page = 932 _Mbcurmax = 2 _Isclocale = 0 ...} ...代表(并且这似乎是问题) _Isleadbyte 成员,初始化为32个零(unsigned char类型)的数组。因此,当函数处理'\x32'超前字节时,它检查这个数组,并自然得出这个结果是错误的: / em>引导字节。所以它很高兴地调用 MultiByteToWideChar() Win-API函数,当然,转换半角字符失败。因此, _Mbrtowc()返回错误代码-1,它或多或少地取消调用堆栈中的所有内容,最终取消2( std :: codecvt_base :: result :: error )。

When debugging, I found out that in() ends up calling the internal _Mbrtowc() function (crt\src\xmbtowc.c), passing the internal (C?) part of the std::locale, initialized with {_Page=932 _Mbcurmax=2 _Isclocale=0 ...}, where ... stands for (and this seems to be the problem) the _Isleadbyte member, initialized to an array of 32 zeros (of type unsigned char). Thus, when the function processes the '\x32' lead byte, it checks with this array and naturally comes to the (wrong) conclusion that this is not a lead byte. So it happily calls the MultiByteToWideChar() Win-API function, which, of course, fails to convert the halfed character. So, _Mbrtowc() returns the error code -1, which more or less cancels everything up the call stack and ultimately the 2 (std::codecvt_base::result::error) is returned.

这是MS标准库中的一个错误吗? (如何)以便携式方式解决这个问题(即最少量的 #ifdef )?

Is this a bug in the MS standard library (it seems so)? (How) can I work around this in a portable way (i.e. with the least amount of #ifdefs)?

推荐答案

我在内部向Microsoft报告。现在已经把它作为一个新的错误(DevDiv#737880)。但我建议您在以下网址填写连接项目: http://connect.microsoft.com/VisualStudio

I reported it internally to Microsoft. The have now filled it as a new bug (DevDiv#737880). But I recomment to fill out a connect item at: http://connect.microsoft.com/VisualStudio

这篇关于MSVC上的双字节编码(std :: codecvt):无法识别导联字节的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆