使用ICU实现我自己的codecvt facet [英] Using ICU to implement my own codecvt facet

查看:255
本文介绍了使用ICU实现我自己的codecvt facet的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想使用ICU从内部转换任何字符编码(ICU支持)到UTF-8,实现一个 codecvt 我知道 codecvt_byname 存在,它可以用来做我想要的一部分,如此示例。该示例的问题是它(1)使用宽字符流(我想使用常规,面向字节的流),(2)需要2个流来执行转换。相反,我想要一个单一的流如:

I want to implement a codecvt facet using ICU to convert from any character encoding (that ICU supports) to UTF-8 internally. I'm aware that codecvt_byname exists and that it can be used to do part of what I want as shown in this example. The problems with that example are that it (1) uses wide character streams (I want to use "regular", byte-oriented streams) and (2) requires 2 streams to perform the conversion. Instead, I want a single stream like:

locale loc( locale(), new icu_codecvt( "ISO-8859-1" ) );
ifstream ifs;
ifs.imbue( loc );
ifs.open( "/path/to/some/file.txt" );
// data read from ifs here will have been converted from ISO-8859-1 to UTF-8

因此,我要做一个实现,例如 this ,但使用ICU,而不是 iconv
鉴于此,我的 do_in()的实现是:

Hence, I wand to do an implementation like this but using ICU rather than iconv. Given that, my implementation of do_in() is:

icu_codecvt::result icu_codecvt::do_in( state_type &state,
                                        extern_type const *from, extern_type const *from_end,
                                        extern_type const *&from_next, intern_type *to,
                                        intern_type *to_end, intern_type *&to_next ) const {
  from_next = from;
  to_next = to;
  if ( always_noconv_ )
    return noconv;

  our_state *const s = state_store_.get( state );
  UErrorCode err = U_ZERO_ERROR;
  ucnv_convertEx(
    s->utf8_conv_, s->extern_conv_, &to_next, to_end, &from_next, from_end,
    nullptr, nullptr, nullptr, nullptr, false, false, &err
  );
  if ( err == U_TRUNCATED_CHAR_FOUND )
    return partial;
  return U_SUCCESS( err ) ? ok : error;
}

our_state 维护两个 UConverter * 指针,一个用于外部编码(在本示例中为ISO-8859-1),一个用于UTF-8编码。

The our_state object maintains two UConverter* pointers, one for the "external" encoding (in this example, ISO-8859-1) and one for the UTF-8 encoding.

我的问题是:


  1. 我应该指定 nullptr 为上述的枢轴缓冲区,或提供我自己的?

  2. 我不知道什么时候,我应该设置 参数(目前为上述第一个 false )至 true $ b
  3. 不清楚我如何知道什么时候设置 flush 参数(当前第二个 false )到 true ,即我如何知道何时到达输入结束。

  1. Should I specify nullptr for the "pivot" buffer as above, or supply my own?
  2. I'm not sure when, if ever, I should set the reset argument (currently the first false above) to true.
  3. It's not clear how I would know when to set the flush argument (currently the second false above) to true, i.e., how I know when the end of the input has been reached.

有点帮助?​​

推荐答案

codecvt面是​​不同的编码。相反,它从一个外部编码转换为一个内部表示,其中一个字符可能使用多个外部字(通常为字节)编码,其中每个字符由正好一个字表示(例如char,wchar_t,char16_t等)。

The codecvt facet is not intended to convert between different encodings. Instead, it converts from an external encoding where one character is possibly encoded using multiple external word (typically bytes) into an internal representation where each character is represented by exactly one word (e.g. char, wchar_t, char16_t, etc.).

从这个角度来看,结束内部字符序列没有意义。如果没有更多的外部字可用,转换完成,如果最后一个字符仍然不完整,这是传输中的错误。因此,不需要指示转换完成以及相应地没有接口。这应该澄清,flush参数确实应该总是false。

From this perspective it doesn't make sense to "end" an internal character sequence. If there are no more external words available the conversion is done and if the last character remained incomplete this is an error in the transfer. Thus, there is no need to indicate that the conversion is complete and, correspondingly, no interface. This should clarify that the "flush" argument indeed should always be "false".

我意识到UTF-8不太适合有一个单词的代表一个字符。然而,这将困扰你使用标准类型处理字符串的UTF-8处理。

I realize that UTF-8 doesn't quite fit the bill of having one word represent one character. However, this will haunt you enire UTF-8 processing using standard types processing strings. As long as you stay clear of syring modifications things typically work OK, though.

reset参数可能用于处理在一个流中寻找的内容。我认为filebuf应该提供一个新的state_type对象时,寻求。这可能表明ICU内部需要重置。但是,我不知道ICU接口。因此,我也不知道你是否想要提供一个枢轴缓冲区。

The "reset" parameter is probably intended to deal with seeking within a stream. I think filebuf is supposed to provide a fresh state_type object when seeking. This would probably be an indication that the ICU internals want to be reset. However, I don't know about the ICU interface. Thus, I also don't know if you want to supply a pivot buffer.

这篇关于使用ICU实现我自己的codecvt facet的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆