使用iostreams和ICU实时转码字符 [英] Transcoding characters on-the-fly using iostreams and ICU
问题描述
我想要实时转码字符编码。我想使用iostreams和我自己的转码 streambuf
,例如:
I'd like to transcode character encoding on-the-fly. I'd like to use iostreams and my own transcoding streambuf
, e.g.:
xcoder_streambuf xbuf( "UTF-8", "ISO-8859-1", cout.rdbuf() );
cout.rdbuf( &xbuf );
char *utf8_s; // pointer to buffer containing UTF-8 encoded characters
// ...
cout << utf8_s; // characters are written in ISO-8859-1
xcoder_streambuf
将使用ICU的转换器API。它将接收数据(在这种情况下,从 utf8_s
),转码它,并使用iostream的原始 steambuf
。
The implementation of xcoder_streambuf
would use ICU's converters API. It would take the data coming in (in this case, from utf8_s
), transcode it, and write it out using the iostream's original steambuf
.
这是合理的方式吗?
推荐答案
这是一个合理的方法吗?
Is that a reasonable way to go?
是的,但这不是你在现代(如1997年)iostream的预期。
Yes, but it is not the way you are expected to do it in modern (as in 1997) iostream.
通过 overflow(int_type c)定义通过
basic_streambuf< code> virtual function。
The behaviour of outputting through basic_streambuf<>
is defined by the overflow(int_type c)
virtual function.
basic_filebuf<> :: overflow(int_type c = traits :: eof
包括 a_codecvt.out(state,b,p,end,xbuf,xbuf + XSIZE,xbuf_end);
其中 a_codecvt
定义为:
const codecvt<charT,char,typename traits::state_type>& a_codecvt
= use_facet<codecvt<charT,char,typename traits::state_type> >(getloc());
所以你需要 imbue
code>与 codecvt< charT,char,typename traits :: state_type>
转换器。
so you are expected to imbue
a locale
with the appropriate codecvt<charT,char,typename traits::state_type>
converter.
类
codecvt< internT,externT,stateT>
用于从一个字符编码转换为另一个时,例如从宽字符到多字节字符或宽字符编码(如Unicode和EUC)之间。
The class
codecvt<internT,externT,stateT>
is for use when converting from one character encoding to another, such as from wide characters to multibyte characters or between wide character encodings such as Unicode and EUC.
Unicode的标准库支持自1997年以来的进步:
The standard library support for Unicode made some progress since 1997:
专业化codecvt在UTF-32和UTF-8编码方案之间转换。
the specialization codecvt converts between the UTF-32 and UTF-8 encoding schemes.
这似乎是你想要的(ISO-8859-1代码是USC-4代码= UTF-32)。
This seems what you want (ISO-8859-1 codes are USC-4 codes = UTF-32).
如果没有,会更好吗?
If not, what would be better?
我会为UTF8引入一个不同的类型,例如:
I would introduce a different type for UTF8, like:
struct utf8 {
unsigned char d; // d for data
};
struct latin1 {
unsigned char c; // c for character
};
这样,你不能意外地传递UTF8,而ISO-8859- *是预期的。但是你必须写一些接口代码,你的流的类型不会 istream
/ ostream
。
This way you cannot accidentally pass UTF8 where ISO-8859-* is expected. But then you would have to write some interface code, and the type of your streams won't be istream
/ostream
.
免责声明:我从来没有做过这样的事情,所以我不知道在实践中是否可行。
Disclaimer: I never actually did such a thing, so I don't know if it is workable in practice.
这篇关于使用iostreams和ICU实时转码字符的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!