std::string 字符编码 [英] std::string character encoding
问题描述
std::string arrWords[10];
std::vector<std::string> hElemanlar;
......
this->hElemanlar.push_back(std::string(1, this->arrWords[sayKelime][j]).c_str());
......
我正在做的是:arrWord 的每个元素都是一个 std::string.我得到了 arrWord 的第 n 个元素,然后将它们推入 hElemanlar.
What i am doing is: Every element of arrWord is a std::string. I get the n th element of arrWord and then push them into hElemanlar.
假设 arrWords[0] 是test",那么:
Assuming arrWords[0] is "test", then:
this->hElemanlar.push_back("t");
this->hElemanlar.push_back("e");
this->hElemanlar.push_back("s");
this->hElemanlar.push_back("t");
我的问题是,虽然我在使用 arrWords 时没有编码问题,但在 hElemanlar 中,某些 utf-8 字符没有得到很好的打印或处理.我该如何解决?s
And my problem is although i have no encoding problems with arrWords, some utf-8 characters are not printed or treated well in hElemanlar. How can i fix it?s
推荐答案
如果您知道 arrWords[i]
包含 UTF-8 编码的文本,那么您可能需要将字符串拆分为完整的 Unicode人物.
If you know that arrWords[i]
contains UTF-8 encoded text, then you probably need to split the strings into complete Unicode characters.
顺便说一句,而不是说:
As an aside, rather than saying:
this->hElemanlar.push_back(std::string(1, this->arrWords[sayKelime][j]).c_str());
(构造一个临时 std::string,获得它的 c 字符串表示,构造另一个 临时字符串,并将其推送到向量上),例如:
(which constructs a temporary std::string, obtains a the c-string representation of it, constructs another temporary string, and pushes that onto the vector), say:
this->hElemanlar.push_back(std::string(1, this->arrWords[sayKelime][j]))
无论如何.这将需要变成这样:
Anyway. This will need to become something like:
std::string str(1, this-arrWords[sayKelime][j])
if (static_cast<unsigned char>(str[0]) >= 0xC0)
{
for (const char c = this-arrWords[sayKelime][j+1];
static_cast<unsigned char>(c) >= 0x80;
j++)
{
str.push_back(c);
}
}
this->hElemenlar.push_back(str);
注意上面的循环是安全的,因为如果 j
是字符串中最后一个字符的索引,[j+1]
将返回空终止符(这将结束循环).不过,您需要考虑递增 j 如何与其余代码交互.
Note that the above loop is safe, because if j
is the index of the last char in the string, [j+1]
will return the nul-terminator (which will end the loop).
You will need to consider how incrementing j interacts with the rest of your code though.
然后,您需要考虑是否希望 hElemanlar
表示单个 Unicode 代码点(这样做),还是希望包含一个字符 + 后面的所有组合字符?在后一种情况下,您必须将上面的代码扩展为:
You then need to consider whether you want hElemanlar
to represent individual Unicode code points (which this does), or do you want to include a character + all the combining characters that follow? In the latter case, you would have to extend the code above to:
- 解析下一个代码点
- 判断是否是组合字符
- 如果是,则在字符串上推送 UTF-8 序列.
- 重复(一个字符上可以有多个组合字符).
这篇关于std::string 字符编码的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!