std::string 字符编码 [英] std::string character encoding

查看:46
本文介绍了std::string 字符编码的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

std::string arrWords[10];
std::vector<std::string> hElemanlar;

......

this->hElemanlar.push_back(std::string(1, this->arrWords[sayKelime][j]).c_str());

......

我正在做的是:arrWord 的每个元素都是一个 std::string.我得到了 arrWord 的第 n 个元素,然后将它们推入 hElemanlar.

What i am doing is: Every element of arrWord is a std::string. I get the n th element of arrWord and then push them into hElemanlar.

假设 arrWords[0] 是test",那么:

Assuming arrWords[0] is "test", then:

this->hElemanlar.push_back("t");
this->hElemanlar.push_back("e");
this->hElemanlar.push_back("s");
this->hElemanlar.push_back("t");

我的问题是,虽然我在使用 arrWords 时没有编码问题,但在 hElemanlar 中,某些 utf-8 字符没有得到很好的打印或处理.我该如何解决?s

And my problem is although i have no encoding problems with arrWords, some utf-8 characters are not printed or treated well in hElemanlar. How can i fix it?s

推荐答案

如果您知道 arrWords[i] 包含 UTF-8 编码的文本,那么您可能需要将字符串拆分为完整的 Unicode人物.

If you know that arrWords[i] contains UTF-8 encoded text, then you probably need to split the strings into complete Unicode characters.

顺便说一句,而不是说:

As an aside, rather than saying:

this->hElemanlar.push_back(std::string(1, this->arrWords[sayKelime][j]).c_str());

(构造一个临时 std::string,获得它的 c 字符串表示,构造另一个 临时字符串,并将其推送到向量上),例如:

(which constructs a temporary std::string, obtains a the c-string representation of it, constructs another temporary string, and pushes that onto the vector), say:

this->hElemanlar.push_back(std::string(1, this->arrWords[sayKelime][j]))

无论如何.这将需要变成这样:

Anyway. This will need to become something like:

std::string str(1, this-arrWords[sayKelime][j])
if (static_cast<unsigned char>(str[0]) >= 0xC0)
{
   for (const char c = this-arrWords[sayKelime][j+1];
        static_cast<unsigned char>(c) >= 0x80;
        j++)
   {
       str.push_back(c);
   }
}
this->hElemenlar.push_back(str);

注意上面的循环是安全的,因为如果 j 是字符串中最后一个字符的索引,[j+1] 将返回空终止符(这将结束循环).不过,您需要考虑递增 j 如何与其余代码交互.

Note that the above loop is safe, because if j is the index of the last char in the string, [j+1] will return the nul-terminator (which will end the loop). You will need to consider how incrementing j interacts with the rest of your code though.

然后,您需要考虑是否希望 hElemanlar 表示单个 Unicode 代码点(这样做),还是希望包含一个字符 + 后面的所有组合字符?在后一种情况下,您必须将上面的代码扩展为:

You then need to consider whether you want hElemanlar to represent individual Unicode code points (which this does), or do you want to include a character + all the combining characters that follow? In the latter case, you would have to extend the code above to:

  • 解析下一个代码点
  • 判断是否是组合字符
  • 如果是,则在字符串上推送 UTF-8 序列.
  • 重复(一个字符上可以有多个组合字符).

这篇关于std::string 字符编码的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆