如何正确计算C中的æøå(Unicode为UTF-8)字符? [英] How to correctly count æ ø å (Unicode as UTF-8) characters in C?

查看:68
本文介绍了如何正确计算C中的æøå(Unicode为UTF-8)字符?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在编写一个简单的程序,该程序对放在链接列表中的文本文件(UTF-8)中的字符进行计数.一切似乎都运行良好,只是每个实例都计算两次æøå(挪威字母中的最后三个字符).因此,如果字符串是æøå,我得到6而不是3.如何解决这个问题?

I am writing a simple program that counts characters from a textfile (UTF-8) that I put in a linked list. Everything seem to work well except that it counts æ ø å (three last characters in the norwegian alphabet) twice for each instance. So if the string is æøå, I get 6 instead of 3. How to fix this?

int length()
{
  pointer = root; // Reset pointer
  int i; // Looping through data in node 
  int len = 0; // Counting characters
  int sizedata = sizeof(pointer->data); // Sets size limit for data in node

  while(pointer != NULL)
    {
      for(i = 0; i < sizedata; i++) // Looping through data in node
    {
      if(pointer->data[i] == '\0') break; // Stops count on end of string
      len++; // Counting characters
    }
      pointer = pointer->next; // Linking to next node
    }
  printf("Length of text is: %d characters\n", len);
}

推荐答案

我根据注意(感谢@Eljay):这是在计算Unicode代码点(以UTF-8编码),而不是字符(字形).一些字符由多个代码点组成.例如,对于x和两个组合代码点,x̝̌为78 cc 9d cc 8c.此例程会将那个1个字符算作3个(代码点)的长度.

Note (thanks @Eljay): This is counting Unicode code points (that are encoded in UTF-8), but not characters (glyphs). Some characters are made up of multiple code points. For example, x̝̌ is 78 cc 9d cc 8c, for the x and the two combining code points. This routine would count that 1 character as a length of 3 (code points).

这篇关于如何正确计算C中的æøå(Unicode为UTF-8)字符?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆