将Unicode代码点转换为UTF-8和UTF-32 [英] Convert Unicode code points to UTF-8 and UTF-32

查看:321
本文介绍了将Unicode代码点转换为UTF-8和UTF-32的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想不出删除前导零的方法.我的目标是在for循环中,然后为每个数字创建UTF-8和UTF-32版本.

I can't think of a way to remove the leading zeros. My goal was in a for loop to then create the UTF-8 and UTF-32 versions of each number.

例如,使用UTF-8,我是否不必删除前导零?有没有人有办法解决这个问题?基本上,我要问的是:有人可以通过简单的方法将Unicode代码点转换为UTF-8吗?

For example, with UTF-8 wouldn't I have to remove the leading zeros? Does anyone have a solution for how to pull this off? Basically what I am asking is: does someone have a easy solution to convert Unicode code points to UTF-8?

    for (i = 0x0; i < 0xffff; i++) {
        printf("%#x \n", i);
        //convert to UTF8
    }

所以这是我要为每个i完成的示例.

So here is an example of what I am trying to accomplish for each i.

  • 例如:Unicode值U + 0760(以16为基数)将转换为UTF8,如下所示:
    • 以二进制形式:1101 1101 1010 0000
    • 以十六进制表示:DD A0
    • For example: Unicode value U+0760 (Base 16) would convert to UTF8 as
      • in binary: 1101 1101 1010 0000
      • in hex: DD A0

      基本上,我试图将每个i都转换为UTF-8中的十六进制等效值.

      Basically I am trying to do that for every i is convert it to its hex equivalent in UTF-8.

      我遇到的问题是,似乎将Unicode转换为UTF-8的过程涉及从位数中删除前导0.我不确定如何动态地做到这一点.

      The problem I am running into is it seems the process for converting Unicode to UTF-8 involves removing leading 0s from the bit number. I am not really sure how to do that dynamically.

      推荐答案

      作为维基百科 UTF-8 页描述,每个Unicode代码点(0到0x10FFFF)以UTF-8字符编码为一到四个字节.

      As the Wikipedia UTF-8 page describes, each Unicode code point (0 through 0x10FFFF) is encoded in UTF-8 character as one to four bytes.

      这是一个简单的示例函数,是从我以前的一篇文章中编辑的.我现在也从整数常量中删除了U后缀. (..其目的是提醒人类程序员,出于某种原因而明确地对常数进行了无符号化处理(根本没有考虑负代码点),并且确实采用了无符号的int code-编译器不在乎,并且可能是因为这种做法似乎是奇怪的,甚至对于这里的长期成员来说也令人困惑,因此,我放弃并停止尝试包括这样的提醒.:()

      Here is a simple example function, edited from one of my earlier posts. I've now removed the U suffixes from the integer constants too. (.. whose intent was to remind the human programmer that the constants are explicitly unsigned for a reason (negative code points not considered at all), and it does assume unsigned int code -- the compiler does not care, and probably because of that this practice seems to be odd and confusing even to long-term members here, so I give up and stop trying to include such reminders. :( )

      static size_t code_to_utf8(unsigned char *const buffer, const unsigned int code)
      {
          if (code <= 0x7F) {
              buffer[0] = code;
              return 1;
          }
          if (code <= 0x7FF) {
              buffer[0] = 0xC0 | (code >> 6);            /* 110xxxxx */
              buffer[1] = 0x80 | (code & 0x3F);          /* 10xxxxxx */
              return 2;
          }
          if (code <= 0xFFFF) {
              buffer[0] = 0xE0 | (code >> 12);           /* 1110xxxx */
              buffer[1] = 0x80 | ((code >> 6) & 0x3F);   /* 10xxxxxx */
              buffer[2] = 0x80 | (code & 0x3F);          /* 10xxxxxx */
              return 3;
          }
          if (code <= 0x10FFFF) {
              buffer[0] = 0xF0 | (code >> 18);           /* 11110xxx */
              buffer[1] = 0x80 | ((code >> 12) & 0x3F);  /* 10xxxxxx */
              buffer[2] = 0x80 | ((code >> 6) & 0x3F);   /* 10xxxxxx */
              buffer[3] = 0x80 | (code & 0x3F);          /* 10xxxxxx */
              return 4;
          }
          return 0;
      }
      

      为它提供一个无符号字符数组(四个字符或更大),以及Unicode代码点.该函数将返回需要多少个字符来编码UTF-8中的代码点,并在数组中分配了这些字符.对于0x10FFFF以上的代码,该函数将返回0(未编码),但否则不会检查Unicode代码点是否有效. IE.它是一个简单的编码器,对Unicode的了解仅是代码点在00x10FFFF(包括)之间.例如,它对代理对一无所知.

      You supply it with an unsigned char array, four chars or larger, and the Unicode code point. The function will return how many chars were needed to encode the code point in UTF-8, and were assigned in the array. The function will return 0 (not encoded) for codes above 0x10FFFF, but it does not otherwise check that the Unicode code point is valid. Ie. it is a simple encoder, and all it knows about Unicode is that the code points are from 0 to 0x10FFFF, inclusive. It knows nothing about surrogate pairs, for example.

      请注意,因为代码点明确是无符号整数,所以根据C规则,负参数将转换为无符号.

      Note that because the code point is explicitly an unsigned integer, negative arguments will be converted to unsigned according to C rules.

      您需要编写一个函数,在每个无符号字符中打印出至少8个有效位(C标准确实允许较大的字符大小,但UTF-8仅使用8位字符).然后,使用上述函数将Unicode代码点(00x10FFFF,包括首尾)转换为UTF-8表示形式,并以递增顺序为数组中的每个未签名char调用bit函数,以计算unsigned char,该代码点返回的上述转换函数.

      You need to write a function that prints out the least 8 significant bits in each unsigned char (the C standard does allow larger char sizes, but UTF-8 only uses 8-bit chars). Then, use the above function to convert an Unicode code point (0 to 0x10FFFF, inclusive) to UTF-8 representation, and call your bit function for each unsigned char in the array, in increasing order, for the count of unsigned char the above conversion function returned for that code point.

      这篇关于将Unicode代码点转换为UTF-8和UTF-32的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆