对于字符串中的每个字符给出错误的结果 [英] For each char in string gives wrong result
问题描述
有一个采用UTF-8编码的字符串,我可以从一个文件中读取它,然后将其写入另一个文件中.但是,当我尝试逐个加载该字符串中的每个字符时,结果不一致.我很可能以非常错误的方式进行此操作,但是正确的方法是什么?
There's a string that is in UTF-8 encoding, I can read it from a file and write it into another file just fine. But when I try to load each of the characters in that string one by one the result isn't coherent. I'm most likely doing this in a very wrong way, but what is the correct way to do this?
source.txt
中的内容是
afternoon_gb_1 ɑftənun
我写的代码是
while (source >> word >> word_ipa) {
for (char& c : word_ipa)
myfile <<word<<" is " << c<< endl;}
txt文件myfile
中的内容写为
The content in the txt file myfile
gets written as
afternoon_gb_1 is �
afternoon_gb_1 is �
afternoon_gb_1 is f
afternoon_gb_1 is t
afternoon_gb_1 is �
afternoon_gb_1 is �
afternoon_gb_1 is n
afternoon_gb_1 is u
afternoon_gb_1 is n
推荐答案
在UTF-8中,每个代码点(=逻辑字符)由多个代码单元(= char
)表示; ɑftənun特别是:
In UTF-8 each code-point (=logical character) is represented by multiple code units (=char
); ɑftənun, in particular, is:
ch| c.p. | c.u.
--+------+-------
ɑ | 0251 | c9 91
f | 0066 | 66
t | 0074 | 74
ə | 0259 | c9 99
n | 006e | 6e
u | 0075 | 75
n | 006e | 6e
(ch =字符; c.p .:代码点编号; c.p.代码单位以UTF-8表示; c.u.和c.p.用十六进制表示)
(ch=character; c.p.: code point number; c.p. code unit representation in UTF-8; c.u. and c.p. are expressed in hexadecimal)
在中解释了如何将代码点映射到代码单元的确切详细信息.很多地方;最基本的是:
-
小于0x7f的
- 代码点直接映射到单个代码单元;对于这些,永远不会设置高位; 从0x80开始的
- 代码点被映射到多个代码单元;多代码单元序列中的所有代码单元都设置了高位;
- 如果高位被置位,则高位具有特殊含义;在多字节序列的第一个字节中,它们告诉我们期望有多少个连续字节,在其他字节中,它们明确地标记为连续字节.
- code points less than 0x7f are mapped straight to a single code unit; for these, the high bit is never set;
- code points from 0x80 onwards are mapped to multiple code units; all the code units in a multi-code-unit sequence have the high bit set;
- if the high bit is set, the top bits have a particular meaning; in the first byte of a multibyte sequence they tell how many continuation bytes are to be expected, in the others they are unambiguously marked as continuation bytes.
如果单独打印每个代码单元,则会破坏需要表达多个代码单元的代码点的UTF-8编码.您在第一行中的终端应用程序看到
If you print out each code unit on its own you are breaking the UTF-8 encoding for the code points that require more than one code unit to be expressed. Your terminal application in the first row sees
c9 0a
(第一个代码单元后跟换行符),并立即检测到这是一个损坏的UTF-8序列,因为c9设置了高位,而下一个c.u.没有它;因此-字符.第二个字符和c.u也一样.序列中代表ə的部分.
(the first code unit followed by a newline), and immediately detects that this is a broken UTF-8 sequence, as c9 has the high bit set but the next c.u. doesn't have it; hence the � character. The same holds for the second character, as well as for the c.u. parts of the sequence representing ə.
现在,如果您想打印出完整的代码点(不是代码单元),std::string
将无济于事-std::string
对这些东西一无所知,它本质上是荣耀的std::vector<char>
,完全忽略了编码问题;它所做的只是存储/索引代码单位,而不是代码点.
Now, if you want to print out full code-points (not code-units), std::string
won't be of any help - std::string
knows nothing about this stuff, it is essentially a glorified std::vector<char>
, completely oblivious of encoding issues; all it does is to store/index code units, not code points.
但是,有第三方库可以帮助您解决此问题; utf8-cpp 很小但是很完整.在您的情况下,utf8::next
函数将特别有用:
There are however third party libraries to help work with this; utf8-cpp is a small but complete one; in your case, the utf8::next
function would be particularly helpful:
while (source >> word >> word_ipa) {
auto cur = word_ipa.begin();
auto end = word_ipa.end();
auto next = cur;
for(;cur!=end; cur=next) {
utf8::next(next, end);
myfile << word << "is ";
for(; cur!=next; ++cur) myfile<<*cur;
myfile << "\n";
}
}
utf8::next
这里只是增加给定的迭代器,使其指向启动下一个代码单元的代码点;此代码可确保我们将组成单个代码点的所有代码单元一起打印.
utf8::next
here just increments the given iterator to make it point to the code point that starts the next code unit; this code makes sure that we print together all the code units that make up a single code point.
请注意,我们可以非常简单地重现其准系统行为,这只是阅读UTF-8规范的问题(请参阅上面的Wikipedia链接中的第一张表):
Notice that we can reproduce its barebones behavior quite simply, it's just a matter of reading the UTF-8 specs (see the first table in the link to Wikipedia above):
template<typename ItT>
void safe_advance(ItT &it, size_t n, ItT end) {
size_t d = std::distance(it, end);
if(n>d) throw std::logic_error("Truncated UTF-8 sequence");
std::advance(it, n);
}
template<typename ItT>
void my_next(ItT &it, ItT end) {
uint8_t b = *it;
if(b>>7 == 0) safe_advance(it, 1, end);
else if(b>>5 == 6) safe_advance(it, 2, end);
else if(b>>4 == 14) safe_advance(it, 3, end);
else if(b>>3 == 30) safe_advance(it, 4, end);
else throw std::logic_error("Invalid UTF-8 sequence");
}
在这里,我们利用了一个事实,即序列的第一个字节声明了将要完成代码单元的额外代码点.
Here we are exploiting the fact that the first byte of a sequence declares how many extra code points are going to come to complete the code unit.
(请注意,这需要有效的UTF-8,并且不会尝试重新同步损坏的UTF-8序列;库版本在这方面的表现可能会更好)
(notice that this expects valid UTF-8 and does not do any attempt to resynchronize a broken UTF-8 sequence; the library version probably fares way better in this regard)
OTOH,也可以内联将同一代码单元保持在一起所需的内容:
OTOH, it's also possible to inline just what's necessary to keep the same code unit together:
while (source >> word >> word_ipa) {
auto cur = word_ipa.begin();
auto end = word_ipa.end();
for(;cur!=end;) {
myfile << word << "is "<<*cur;
if(uint8_t(*cur++)>>7 != 0) {
for(; cur!=end && (uint8_t(*cur)>>6)==2; ++cur) myfile<<*cur;
}
myfile << "\n";
}
}
在这里,我们完全忽略了第一个c.u中的声明的计数",我们只是检查高位是否已设置;在这种情况下,只要得到c.u,我们就继续打印.自"continuation c.u."以来,前两个字节设置为10(二进制,AKA 2十进制).多c.u. UTF-8序列都遵循这种模式.
Here instead we are disregarding completely the "declared count" in the first c.u., we just check if the high bit is set; in this case, we go on printing as long as we get c.u. with the top two bytes set to 10 (in binary, AKA 2 in decimal) - since the "continuation c.u." of a multi-c.u. UTF-8 sequence all follow this pattern.
这篇关于对于字符串中的每个字符给出错误的结果的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!