C ++重复混合字符长度的utf-8字符串 [英] C++ iterate utf-8 string with mixed length of characters

查看:88
本文介绍了C ++重复混合字符长度的utf-8字符串的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我需要遍历utf-8字符串并获取字符串的每个字符.字符串中可能有不同类型的字符,例如长度为1个字节的数字,长度为3个字节的汉字等.

I need to loop over a utf-8 string and get each character of the string. There might be different types of characters in the string, e.g. numbers with the length of one byte, Chinese characters with the length of three bytes, etc.

我看了这篇帖子可以完成80%的工作,只是当字符串在1个字节的数字之前包含3个字节的中文字符时,它将看到该数字也具有3个字节并将其打印为1 ** *乱七八糟.

I looked at this post and it can do 80% of the job, except that when the string has 3-byte chinese characters before 1-byte numbers, it will see the numbers also as having 3 bytes and print the numbers as 1** where * is gibberish.

举个例子,如果字符串是今天星期五123",结果将是:

To give an example, if the string is '今天周五123', the result will be:





1 **
2 **
3 **





1**
2**
3**

其中*为乱码.但是,如果字符串为"123今天星期五",则数字可以很好地打印出来.

where * is gibberish. However if the string is '123今天周五', the numbers will print out fine.

上述

有人可以在这里帮助我吗?我是C语言的新手,尽管我检查了 utf8 cpp 的文档,但我仍然不知道在哪里问题是.我认为该库是为处理您使用不同长度的utf-8编码而创建的,因此应该有一种方法可以解决此问题...已经为此苦苦挣扎了两天...

Can anyone help me here? I am new to c++ and although I checked the documentation of utf8 cpp, I still have no idea where the problem is. I think the library was created to handle such issues where you have utf-8 encodings with different lengths, so there should be a way to do this... Have been struggling with this for two days...

推荐答案

插入

memset(symbol, 0, sizeof(symbol));

之前

utf8::append(code, symbol);  

如果由于某种原因这仍然不起作用,或者您想摆脱lib,识别代码点就不那么复杂了:

If this for some reason still doesn't work, or if you want to get rid of the lib, recognizing codepoints is not that complicated:

string text = "今天周五123";
for(size_t i = 0; i < text.length();)
{
    int cplen = 1;
    if((text[i] & 0xf8) == 0xf0) cplen = 4;
    else if((text[i] & 0xf0) == 0xe0) cplen = 3;
    else if((text[i] & 0xe0) == 0xc0) cplen = 2;
    if((i + cplen) > text.length()) cplen = 1;

    cout << text.substr(i, cplen) << endl;
    i += cplen;
}

但是,使用这两种解决方案时,请注意,存在多个cp字形以及无法单独打印的cp

With both solution, however, be aware that multi-cp glyphs exist, as well as cp's that can't be printed alone

这篇关于C ++重复混合字符长度的utf-8字符串的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆