在C ++中将十进制转换为Unicode Char [英] Decimal to Unicode Char in C++

查看:125
本文介绍了在C ++中将十进制转换为Unicode Char的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在输出时,如何将十进制数字(例如225)转换为其对应的Unicode字符?我可以将ASCII字符从十进制转换为如下字符:

How do I convert a decimal number, 225 for example, to its corresponding Unicode character when it's being output? I can convert ASCII characters from decimal to the character like this:

int a = 97;
char b = a;
cout << b << endl;

它输出字母"a",但是当我使用数字225或任何非ascii字符时,它只会输出一个问号.

And it output the letter "a", but it just outputs a question mark when I use the number 225, or any non-ascii character.

推荐答案

首先,不是您的C ++程序将写入标准输出的字节字符串转换为可见字符.它是您的终端机(或更常见的是,这几天是您的终端机模拟器).不幸的是,无法询问终端如何期望对字符进行编码,因此需要将其配置到您的环境中.通常,这是通过设置适当的locale环境变量来完成的.

To start with, it's not your C++ program which converts strings of bytes written to standard output into visible characters; it's your terminal (or, more commonly these days, your terminal emulator). Unfortunately, there is no way to ask the terminal how it expects characters to be encoded, so that needs to be configured into your environment; normally, that's done by setting appropriate locale environment variables.

就像大多数与终端有关的事情一样,如果语言环境配置系统的开发没有悠久的传统软件和硬件的历史,那么它们可能会大为不同.考虑带重音字母,音节或表意文字的细腻之处. C'est la vie.

Like most things which have to do with terminals, the locale configuration system would probably have been done very differently if it hadn't developed with a history of many years of legacy software and hardware, most of which were originally designed without much consideration for niceties like accented letters, syllabaries or ideographs. C'est la vie.

Unicode非常酷,但是必须面对书写系统的特定计算机历史来部署Unicode,这意味着面对各种固执但根本矛盾的观点时,必须做出很多妥协.软件工程社区dicho sea de paso是一个社区,在该社区中,折衷的做法比较普遍,折衷的做法是. Unicode最终成为或多或少成为 标准的事实证明了其扎实的技术基础以及其发起人和设计师(尤其是Mark Davis)的毅力和政治技巧,我说这是尽管达到这一点基本上花费了超过二十年的时间.

Unicode is pretty cool, but it also had to be deployed in the face of the particular history of computer representation of writing systems, which meant making a lot of compromises in the face of the various firmly-held but radically contradictory opinions in the software engineering community, dicho sea de paso a community in which head-butting is rather more common that compromise. The fact that Unicode has eventually become more or less the standard is a testimony to its solid technical foundations and the perseverance and political skills of its promoters and designers -- particularly Mark Davis --, and I say this despite the fact that it basically took more than two decades to get to this point.

此协商和折衷历史的一方面是,有多种方法可以将Unicode字符串编码为位.至少有三种方式,其中两种取决于字节序有两种不同的版本;此外,这些编码系统中的每一个都有专用的风扇(因此也有教条式的批评者).特别是Windows早先决定采用16位编码UTF-16,而大多数unix(like)系统使用可变长度8到32位编码UTF-8. (从技术上讲,UTF-16也是16位或32位编码,但这超出了本条款的范围.)

One of the aspects of this history of negotiation and compromise is that there is more than one way to encode a Unicode string into bits. There are at least three ways, and two of those have two different versions depending on endianness; moreover, each of these coding systems has its dedicated fans (and consequently, its dogmatic detractors). In particular, Windows made an early decision to go with a mostly-16-bit encoding, UTF-16, while most unix(-like) systems use a variable-length 8-to-32-bit encoding, UTF-8. (Technically, UTF-16 is also a 16- or 32-bit encoding, but that's beyond the scope of this rant.)

在Unicode之前,每个国家/地区/语言都使用自己的特有的8位编码(或者至少是那些使用少于194个字符的字母书写的国家/地区).因此,将编码配置为本地表示的一般配置的一部分是有意义的,例如月份的名称,货币符号以及什么字符将数字的整数部分与其小数部分分开.既然Unicode有广泛的(但仍远未达到普遍的)融合,考虑到所有语言都可以表示相同的Unicode字符串,并且编码通常更特定于该特定语言,那么语言环境包括Unicode编码的特定样式似乎很奇怪.所使用的软件要比国家特殊性高.但这就是这个,这就是为什么在我的Ubuntu盒子上,环境变量LANG设置为es_ES.UTF-8而不只是es_ES的原因. (或者应该是es_PE,除了我在该语言环境中遇到了一些小问题.)如果您使用的是Linux系统,则可能会发现类似的内容.

Pre-Unicode, every country/language used their own idiosyncratic 8-bit encoding (or, at least, those countries whose languages are written with an alphabet of less than 194 characters). Consequently, it made sense to configure the encoding as part of the general configuration of local presentation, like the names of months, the currency symbol, and what character separates the integer part of a number from its decimal fraction. Now that there is widespread (but still far from universal) convergence on Unicode, it seems odd that locales include the particular flavour of Unicode encoding, given that all flavours can represent the same Unicode strings and that the encoding is more generally specific to the particular software being used than the national idiosyncrasy. But it is, and that's why on my Ubuntu box, the environment variable LANG is set to es_ES.UTF-8 and not just es_ES. (Or es_PE, as it should be, except that I keep running into little issues with that locale.) If you're using a linux system, you might find something similar.

从理论上讲,这意味着我的终端仿真器(konsole,它发生了,但是种类很多)期望看到UTF-8序列.而且,的确,konsole足够聪明,可以检查语言环境设置并设置其默认编码以匹配,但是我可以随意更改编码(或语言环境设置),并且可能会造成混乱.

In theory, that means that my terminal emulator (konsole, as it happens, but there are various) expects to see UTF-8 sequences. And, indeed, konsole is clever enough to check the locale setting and set up its default encoding to match, but I'm free to change the encoding (or the locale settings), and confusion is likely to result.

因此,假设您的语言环境设置和终端使用的编码实际上是同步的,它们应该在配置良好的工作站上,然后返回C ++程序.现在,C ++程序需要弄清楚应该使用哪种编码,然后将其使用的任何内部表示形式转换为外部编码.

So let's suppose that your locale settings and the encoding used by your terminal are actually in synch, which they should be on a well-configure workstation, and go back to the C++ program. Now, the C++ program needs to figure out which encoding it's supposed to use, and then transform from whatever internal representation it uses to the external encoding.

幸运的是,如果您通过以下方式进行合作,则C ++标准库应该可以正确处理此问题:

Fortunately, the C++ standard library should handle that correctly, if you cooperate by:

  1. 告诉标准库以使用配置的语言环境,而不是默认的C(即,按照英语仅是非重音字符)语言环境;和

  1. Telling the standard library to use the configured locale, instead of the default C (i.e. only unaccented characters, as per English) locale; and

使用基于wchar_t(或其他宽字符格式)的字符串和iostream.

Using strings and iostreams based on wchar_t (or some other wide character format).

如果这样做,从理论上讲,您既不必知道wchar_t对标准库的含义,也不需要知道特定的位模式对终端仿真器的含义.因此,让我们尝试一下:

If you do that, in theory you don't need to know either what wchar_t means to your standard library, nor what a particular bit pattern means to your terminal emulator. So let's try that:

#include <iostream>
#include <locale>

int main(int argc, char** argv) {
  // std::locale()   is the "global" locale
  // std::locale("") is the locale configured through the locale system
  // At startup, the global locale is set to std::locale("C"), so we need
  // to change that if we want locale-aware functions to use the configured
  // locale.
  // This sets the global" locale to the default locale. 
  std::locale::global(std::locale(""));

  // The various standard io streams were initialized before main started,
  // so they are all configured with the default global locale, std::locale("C").
  // If we want them to behave in a locale-aware manner, including using the
  // hopefully correct encoding for output, we need to "imbue" each iostream
  // with the default locale.
  // We don't have to do all of these in this simple example,
  // but it's probably a good idea.
  std::cin.imbue(std::locale());
  std::cout.imbue(std::locale());
  std::cerr.imbue(std::locale());
  std::wcin.imbue(std::locale());
  std::wcout.imbue(std::locale());
  std::wcerr.imbue(std::locale());

  // You can't write a wchar_t to cout, because cout only accepts char. wcout, on the
  // other hand, accepts both wchar_t and char; it will "widen" char. So it's
  // convenient to use wcout:
  std::wcout << "a acute: " << wchar_t(225) << std::endl;
  std::wcout << "pi:      " << wchar_t(960) << std::endl;
  return 0;
}

在我的系统上有效. YMMV.祝你好运.

That works on my system. YMMV. Good luck.

小注释:我已经遇到很多人认为wcout自动写宽字符",因此使用它会产生UTF-16或UTF-32之类的东西.没有.它产生与cout完全相同的编码.区别不在于它输出的是什么,而是它作为输入接受的.实际上,它实际上与cout并没有什么不同,因为它们都连接到同一OS流,该OS流一次只能具有一种编码.

Small side-note: I've run into lots of people who think that wcout automatically writes "wide characters", so that using it will produce UTF-16 or UTF-32 or something. It doesn't. It produces exactly the same encoding as cout. The difference is not what it outputs but what it accepts as input. In fact, it can't really be different from cout because both of them are connected to the same OS stream, which can only have one encoding (at a time).

您可能会问为什么必须要有两个不同的iostream.为什么cout不能只接受wchar_tstd::wstring值?我实际上没有答案,但是我怀疑这是不为不需要的功能付费的哲学的一部分.或类似的东西.如果您知道了,让我知道.

You might ask why it is necessary to have two different iostreams. Why couldn't cout have just accepted wchar_t and std::wstring values? I don't actually have an answer for that, but I suspect it is part of the philosophy of not paying for features you don't need. Or something like that. If you figure it out, let me know.

这篇关于在C ++中将十进制转换为Unicode Char的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆