utf-8中std :: string的子字符串? C ++ 11 [英] Substring of a std::string in utf-8? C++11

查看:274
本文介绍了utf-8中std :: string的子字符串? C ++ 11的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我需要在假设为utf8的std :: string中获取前N个字符的子字符串. 我了解到.substr不能正常工作的困难方式.

I need to get a substring of the first N characters in a std::string assumed to be utf8. I learned the hard way that .substr does not work... as... expected.

参考:我的字符串可能如下所示:任务:\ n \ n1亿2千万匹

Reference: My strings probably look like this: mission:\n\n1億2千万匹

推荐答案

I found this code and am just about to try it out.

std::string utf8_substr(const std::string& str, unsigned int start, unsigned int leng)
{
    if (leng==0) { return ""; }
    unsigned int c, i, ix, q, min=std::string::npos, max=std::string::npos;
    for (q=0, i=0, ix=str.length(); i < ix; i++, q++)
    {
        if (q==start){ min=i; }
        if (q<=start+leng || leng==std::string::npos){ max=i; }

        c = (unsigned char) str[i];
        if      (
                 //c>=0   &&
                 c<=127) i+=0;
        else if ((c & 0xE0) == 0xC0) i+=1;
        else if ((c & 0xF0) == 0xE0) i+=2;
        else if ((c & 0xF8) == 0xF0) i+=3;
        //else if (($c & 0xFC) == 0xF8) i+=4; // 111110bb //byte 5, unnecessary in 4 byte UTF-8
        //else if (($c & 0xFE) == 0xFC) i+=5; // 1111110b //byte 6, unnecessary in 4 byte UTF-8
        else return "";//invalid utf8
    }
    if (q<=start+leng || leng==std::string::npos){ max=i; }
    if (min==std::string::npos || max==std::string::npos) { return ""; }
    return str.substr(min,max);
}

更新:这对我当前的问题非常有效.我必须将它与get-length-of-utf8encoded-stdsstring函数混合使用.

Update: This worked well for my current issue. I had to mix it with a get-length-of-utf8encoded-stdsstring function.

此解决方案在我的编译器上发出了一些警告:

This solution had some warnings spat at it by my compiler:

这篇关于utf-8中std :: string的子字符串? C ++ 11的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆