如何正确计算C中的æøå(Unicode为UTF-8)字符? [英] How to correctly count æ ø å (Unicode as UTF-8) characters in C?
问题描述
我正在编写一个简单的程序,该程序对放在链接列表中的文本文件(UTF-8)中的字符进行计数.一切似乎都运行良好,只是每个实例都计算两次æøå(挪威字母中的最后三个字符).因此,如果字符串是æøå,我得到6而不是3.如何解决这个问题?
I am writing a simple program that counts characters from a textfile (UTF-8) that I put in a linked list. Everything seem to work well except that it counts æ ø å (three last characters in the norwegian alphabet) twice for each instance. So if the string is æøå, I get 6 instead of 3. How to fix this?
int length()
{
pointer = root; // Reset pointer
int i; // Looping through data in node
int len = 0; // Counting characters
int sizedata = sizeof(pointer->data); // Sets size limit for data in node
while(pointer != NULL)
{
for(i = 0; i < sizedata; i++) // Looping through data in node
{
if(pointer->data[i] == '\0') break; // Stops count on end of string
len++; // Counting characters
}
pointer = pointer->next; // Linking to next node
}
printf("Length of text is: %d characters\n", len);
}
推荐答案
Note (thanks @Eljay): This is counting Unicode code points (that are encoded in UTF-8), but not characters (glyphs). Some characters are made up of multiple code points. For example, x̝̌ is 78 cc 9d cc 8c, for the x and the two combining code points. This routine would count that 1 character as a length of 3 (code points).
这篇关于如何正确计算C中的æøå(Unicode为UTF-8)字符?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!