获取字符串长度在C中的UTF-8? [英] Getting the string length on UTF-8 in C?

查看:236
本文介绍了获取字符串长度在C中的UTF-8?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

可以使用与此类似的方法来完成此操作吗?

Can this be done using a method similar to this one:

只要用户通过scanf输入的字符串的当前元素不是\ 0,就在"length" int中添加一个,然后打印出长度.

As long as the current element of the string the user input via scanf is not \0, add one to the "length" int and then print out the length.

如果有人可以作为我的初学者,以最简单的方式指导我,我将不胜感激.

I would be very grateful if anybody could guide me through the least complex way possible as I am a beginner.

非常感谢,祝您有一个愉快的旅程!

Thank you very much, have a good one!

推荐答案

字符串长度是什么意思?

使用strlen(s)可以轻松获得字节数.

The number of bytes is easily obtained with strlen(s).

可以通过计算单字节字符数(范围为1到127)和前导字节数(范围为0xC0到0xFF)来计算以UTF-8编码的代码点数,而忽略连续字节(范围为0x80到0x80). 0xBF),然后在'\0'处停止.

The number of code points encoded in UTF-8 can be computed by counting the number of single byte chars (range 1 to 127) and the number of leading bytes (range 0xC0 to 0xFF), ignoring continuation bytes (range 0x80 to 0xBF) and stopping at '\0'.

这是执行此操作的简单功能:

Here is a simple function to do this:

size_t count_utf8_code_points(const char *s) {
    size_t count = 0;
    while (*s) {
        count += (*s++ & 0xC0) != 0x80;
    }
    return count;
}

此函数假定s指向的数组的内容已正确编码.

This function assumes that the contents of the array pointed to by s is properly encoded.

还要注意,这将计算代码点的数量,而不是显示的字符数,因为其中一些字符可以使用多个组合代码点(例如<LATIN CAPITAL LETTER A>后跟<COMBINING ACUTE ACCENT>)进行编码.

Also note that this will compute the number of code points, not the number of characters displayed, as some of these may be encoded using multiple combining code points, such as <LATIN CAPITAL LETTER A> followed by <COMBINING ACUTE ACCENT>.

这篇关于获取字符串长度在C中的UTF-8?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆