Unicode文字-这甚至有什么意义? [英] Unicode literal - how does this even make sense?

查看:154
本文介绍了Unicode文字-这甚至有什么意义?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

int main() {    
    std::cout << "\u2654" << std::endl; // Result #1: ♔
    std::cout << U'\u2654' << std::endl; // Result #2: 9812
    std::cout << U'♔' << std::endl; // Result #3: 9812
    return 0;
}

我很难理解Unicode如何与C ++一起使用。为什么文字不在终端中输出文字?

I am having trouble understanding how Unicode works with C++. Why does not the literal output the literal in the terminal?

我有点想像这样工作;

I kind of want something like this to work;

char32_t txt_representation() { return /* Unicode codepoint */; } 

注意:源是UTF-8,位于macOS Sierra,CLion上的终端也是如此。

Note: the source is UTF-8 and so is the terminal, sitting on macOS Sierra, CLion.

推荐答案

Unicode和C ++

有几种unicode编码:

There are several unicode encodings:


  • UTF-8 将每个Unicode字符编码为一到四个(8位)字节( char )的序列

  • UTF-16 (可以是BE和LE取决于字节序)将每个unicode字符编码为一个或两个16位字( char16_t )的序列。

  • UTF-32 (还是BE或LE)将每个unicode字符编码为一个32位字( char32_t )。

  • UTF-8 encodes each unicode character into a sequence of one to four (8- bit) bytes (char)
  • UTF-16 (which can be BE and LE depending on endianness) encodes each unicode character into a sequence of one or two 16 bit words (char16_t).
  • UTF-32 (again BE or LE) encodes each unicode character into one 32 bit word (char32_t).

这里是 关于使用C ++进行unicode的优秀视频教程 。他解释了您需要了解的有关字符集编码,unicode及其不同编码以及如何在C ++中使用它的所有知识。

Here is an excellent video tutorial on unicode with C++ by James McNellis. He explains everything you need to know on character set encoding, on unicode and its different encodings, and how to use it in C++.

您的代码

\u2654 是一个窄字符串文字,其类型数组为 char 白棋国王Unicode字符将被编码为3个连续的字符转换为UTF-8编码( {0xe2、0x99、0x94} )。因为我们在一个字符串中,所以在其中包含几个字符是没有问题的。由于您的控制台语言环境肯定使用UTF8,因此在显示字符串时,它将正确解释序列解码。

"\u2654" is a a narrow string literal, that has the type array of char. The white chess king unicode character will be encoded as 3 consecutive chars corresponding to the UTF-8 encoding ({ 0xe2, 0x99, 0x94 }). As we are in a string, there is no problem of having several chars in it. As your console locale certainly uses UTF8, it will interpret correctly decode the sequence when the string is displayed.

U'\u2654' char32_t (因为大写的U)。由于它是char32_t(而不是char),因此不会显示为char,而是显示为整数。十进制的值是9812。如果您使用十六进制,您会立即识别出来。

U'\u2654' is a character literal of type char32_t (because of the uppercase U). As it is a char32_t (and not a char), it is not displayed as a char, but as an integer value. The value in decimal is 9812. Whould you use hex, you would have recognized it immediately.

最后一个 U’♔’遵循相同的逻辑。但是请注意,您在源代码中嵌入了unicode字符。只要编辑器的字符编码与编译器期望的源编码匹配,就可以了。但是,如果将文件复制(不进行转换)到期望使用不同编码的环境中,则可能会导致不匹配。

The last U'♔' obeys the same logic. Be aware however that you embed a unicode character in the source code. This is fine as long as the editor's character encoding matches the source code encoding expected by the compiler. But this could cause mismatches if file would be copied (without conversion) to environments expecting a different encoding.

这篇关于Unicode文字-这甚至有什么意义?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆