麻烦与std :: codecvt_utf8 facet [英] trouble with std::codecvt_utf8 facet

查看:232
本文介绍了麻烦与std :: codecvt_utf8 facet的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

这里是一段代码,它使用 std :: codecvt_utf8<> facet从 wchar_t 到UTF-8。在Visual Studio 2012中,我的期望没有达到(见代码末尾的条件)。我的期望错了吗?为什么?或者是Visual Studio 2012库问题?

Here is a snippet of a code that is using std::codecvt_utf8<> facet to convert from wchar_t to UTF-8. With Visual Studio 2012, my expectations are not met (see the condition at the end of the code). Are my expectations wrong? Why? Or is this a Visual Studio 2012 library issue?

#include <locale>
#include <codecvt>
#include <cstdlib>

int main ()
{
    std::mbstate_t state = std::mbstate_t ();
    std::locale loc (std::locale (), new std::codecvt_utf8<wchar_t>);
    typedef std::codecvt<wchar_t, char, std::mbstate_t> codecvt_type;
    codecvt_type const & cvt = std::use_facet<codecvt_type> (loc);

    wchar_t ch = L'\u5FC3';
    wchar_t const * from_first = &ch;
    wchar_t const * from_mid = &ch;
    wchar_t const * from_end = from_first + 1;

    char out_buf[1];
    char * out_first = out_buf;
    char * out_mid = out_buf;
    char * out_end = out_buf + 1;

    std::codecvt_base::result cvt_res
        = cvt.out (state, from_first, from_end, from_mid,
            out_first, out_end, out_mid);

    // This is what I expect:
    if (cvt_res == std::codecvt_base::partial
        && out_mid == out_end
        && state != 0)
        ;
    else
        abort ();
}

这里的期望是 / code>函数每次输出一个字节的UTF-8转换,但是在Visual Studio 2012的 if 条件的中间是false。

The expectation here is that the out() function output one byte of the UTF-8 conversion at a time but the middle of the if conditional above is false with Visual Studio 2012.

失败的是 out_mid == out_end state!= 0 条件。基本上,我期望至少产生一个字节,并且必须将可生成的UTF-8序列的下一个字节的状态存储在状态变量中。

What fails is the out_mid == out_end and state != 0 conditions. Basically, I expect at least one byte to be produced and the necessary state, for next byte of the UTF-8 sequence to be producible, to be stored in the state variable.

推荐答案

partial 的返回代码 codecvt :: do_out 正是这样描述:表83中的

The standard description of partial return code of codecvt::do_out says exactly this:

部分并非所有源字符都已转换

partial not all source characters converted

在22.4.1.4.2 [locale.codecvt.virtuals] / 5中:

In 22.4.1.4.2[locale.codecvt.virtuals]/5:


返回:枚举值,如表83中所总结的。如果(from_next == from_end)的返回值 partial 目的序列
没有吸收所有可用的目的地元素,或者在产生另一个目的地元素之前需要额外的源元素。

Returns: An enumeration value, as summarized in Table 83. A return value of partial, if (from_next==from_end), indicates that either the destination sequence has not absorbed all the available destination elements, or that additional source elements are needed before another destination element can be produced.

在你的情况下,不是所有的(零)源字符都被转换,这在技术上没有说明输出序列的内容(句子中的if子句没有输入),但一般来说,序列没有吸收所有可用的目标元素这里谈论有效的多字节字符。它们是由 codecvt_utf8 生成的多字节字符序列的元素

In your case, not all (zero) source characters were converted, which technically says nothing of the contents of the output sequence (the 'if' clause in the sentence is not entered), but speaking generally, "the destination sequence has not absorbed all the available destination elements" here talks about valid multibyte characters. They are the elements of the multibyte character sequence produced by codecvt_utf8.

很好有一个更明确的标准措辞,但这里是两个间接的证据:

It would be nice to have a more explicit standard wording, but here are two circumstantial pieces of evidence:

一个:旧的C的宽到多字节转换函数 std :: wcsrtombs (对于系统提供的语言环境,通常通过 codecvt :: do_out 的现有实现调用其特定于语言环境的变体)定义如下:

One: the old C's wide-to-multibyte conversion function std::wcsrtombs (whose locale-specific variants are usually called by the existing implementations of codecvt::do_out for system-supplied locales) is defined as follows:


当下一个多字节字符超过要存储的总字节数限制时,停止转换[...]到dst指向的数组。

Conversion stops [...] when the next multibyte character would exceed the limit of len total bytes to be stored into the array pointed to by dst.

其次,看看现有的 codecvt_utf8 :你已经探索了微软的,这里是libc ++的: codecvt_utf8 :: do_out 这里调用 ucs2_to_utf8 在Windows上,并且 ucs4_to_utf8 在其他系统上,并且ucs2_to_utf8 执行以下操作(注释我的):

And two, look at the existing implementations of codecvt_utf8: you've already explored Microsoft's, and here's what's in libc++: codecvt_utf8::do_out here calls ucs2_to_utf8 on Windows and ucs4_to_utf8 on other systems, and ucs2_to_utf8 does the following (comments mine):

        else if (wc < 0x0800)
        {
            // not relevant
        }
        else // if (wc <= 0xFFFF)
        {
            if (to_end-to_nxt < 3)
                return codecvt_base::partial; // <- look here
            *to_nxt++ = static_cast<uint8_t>(0xE0 |  (wc >> 12));
            *to_nxt++ = static_cast<uint8_t>(0x80 | ((wc & 0x0FC0) >> 6));
            *to_nxt++ = static_cast<uint8_t>(0x80 |  (wc & 0x003F));
        }

如果输出序列不适合多字节字符,从使用一个输入宽字符。

nothing is written to the output sequence if it cannot fit a multibyte character that results from consuming one input wide character.

这篇关于麻烦与std :: codecvt_utf8 facet的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆