使用C在终端中打印多字节字符 [英] Printing multi-byte characters in terminal using C

查看：201 发布时间：2020/7/13 6:19:55 c encoding utf-8 terminal

本文介绍了使用C在终端中打印多字节字符的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我一直在尝试一个看起来像这样的自定义字符串对象(结构):

I've been experimenting with a custom string object (struct) which looks like this:

typedef struct
{
    int encoding;
    int length;
    character * array;
} EncodedString;

这个想法是，通过指定编码，我可以创建一些使用该编码来正确打印字符串的函数，即ASCII或utf-8或utf-16等(对我的字符编码无知.)

The idea is that by specifying the encoding, I can make a few functions which use that encoding to print the string correctly, i.e. ASCII or utf-8 or utf-16, etc. (Excuse my character encoding ignorance.)

现在，我正在尝试打印一个(普通话)汉字:狗(0x72d7).我认为也许通过逐个字符打印它可以正常工作，但显然不行.它只打印"r?" (分别为0x72和0xd7).那么，如何修改该程序以使其打印出字符呢?

Right now, I'm trying to print out one (Mandarin) Chinese character: 狗 (0x72d7). I thought perhaps by printing it character by character, it would work properly, but obviously not. It printed just "r?" (0x72 and 0xd7, respectively). So how can I amend this program so that it prints the character?

#include <stdio.h>

typedef unsigned char character;

typedef struct
{
    int encoding;
    int length;
    character * array;
} EncodedString;

void printString(EncodedString str);

int main(void)
{

    character doginmandarin[] = {0x72U, 0xd7U};
    EncodedString mystring = {0, sizeof doginmandarin, doginmandarin};

    printString(mystring);
    printf("\n");

    return 0;

}

void printString(EncodedString str) // <--- where I try to print the character
{
    int i;
    for(i = 0; i < str.length; i++)
    {
        printf("%c", str.array[i]);
    }
}

理想情况下，如果包含字符的数组仅包含无符号字符，则我更愿意，这意味着将组成字符狗的两个字节分开.尽管它现在没有任何作用，但其想法是使用EncodedString结构的encoding字段来确定每个字符有多少字节.

Ideally, I would prefer if I the array containing the characters only contains unsigned chars, which means separating the two bytes making up the character 狗. Although it's not serving any purpose now, the idea is to use the encoding field of the EncodedString struct to determine how many bytes each character is.

如何用最少的黑客攻击实现这一目标?

How can this be implemented with the least amount of hacks?

推荐答案

数字Ox72d7是要打印的字符的Unicode代码点(抽象数字).当在内存中用两个字节0x72, 0xd7表示时，它将成为该字符的UCS-2代码，也恰好是其UTF-16编码.但是您的终端可能期望使用UTF-8编码的字符.代码点Ox72d7的正确UTF-8编码是0xe7, 0x8b, 0x97.

The number Ox72d7 is the Unicode code point (abstract number) for the character you want to print. When represented in memory with two bytes 0x72, 0xd7, it becomes the UCS-2 code for that character which also happens to be its UTF-16 encoding. But your terminal is probably expecting UTF-8 encoded characters. The correct UTF-8 encoding for the code point Ox72d7 is 0xe7, 0x8b, 0x97.

您可以修复代码以使用UTF-8编码的字符，但是这种编码对于内存表示非常不切实际，因为它为不同的字符产生不同数量的字节.这使得简单的字符串操作(如使第n个字符变得非常复杂)成为可能.取而代之的是，经常使用固定长度的表示形式.例如，UCS-2始终每个字符使用两个字节.然后，在打印字符串之前，尽可能晚地转换为外部表示形式的编码.

You could fix your code to use UTF-8 encoded characters but this encoding is very impractical for memory representation since it produces different numbers of bytes for different characters. This makes simple string operations like getting the nth character very complicated. Instead, fixed-length representations are often used. For example UCS-2 always uses two bytes per character. The conversion to the external representation encoding is then done as late as possible, just before printing the strings.

编辑(根据评论)

UTF-8是一个棘手的编码.从代码点到UTF-8字节的映射并非易事，涉及一些按位的巨型组合.这是一种霍夫曼代码，不同的前缀告诉字符将占用多少字节.另外，以下所有字节均以0b10开头，以检测格式错误的UTF-8.在此处进行了描述: http://en.wikipedia.org/wiki/UTF-8#Description

UTF-8 is a tricky encoding. The mapping from code points to UTF-8 bytes is not trivial and involves some bitwise mumbo-jumbo. It's a kind of Huffman code, different prefixes tell how many bytes the character will occupy. Also all the following bytes start with 0b10 in order to detect malformed UTF-8. It's described here: http://en.wikipedia.org/wiki/UTF-8#Description

为了快速找到我的帖子的三个字节，我只是在python控制台中键入了此内容:u"\u72d7".encode('UTF-8')

In order to find the three bytes quickly for my post I just typed this in a python console: u"\u72d7".encode('UTF-8')

这篇关于使用C在终端中打印多字节字符的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

使用C在终端中打印多字节字符 [英] Printing multi-byte characters in terminal using C

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

使用C在终端中打印多字节字符 [英] Printing multi-byte characters in terminal using C

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭