UTF-8编码文本中的未知字符 [英] Unknown character in UTF-8 encoded text
问题描述
我有一个包含一些数据的文件。该数据以UTF-8编码(没有BOM)
这些字节通常没有问题需要处理。然而在该文件中知道有一个字节序列我不知道它应该代表什么(我也不能找到任何关于它的信息)
检查日期我在十六进制编辑器中打开了文件。有UTF-8 char序列非常正常( C3 BC
ü
和 C3 B6
ö
等等。)
然而有以下顺序我不知道如何到达预期的字符:
C3 83 EF BF BF
从上下文中我可以知道它应该代表字符ü。然而,我不知道你怎么可能达到那个序列...
示例文件中的示例( Hex View):
I have a file which contains some data. That data is encoded in UTF-8 (without a BOM)
Those bytes are usually no problem to handle. Yet know in that file there is a byte sequence I don't know what it should represent (neither could I find any information about it too)
To examine the date I opened the file in a hex editor. There were UTF-8 char sequences which were pretty normal (C3 BC
for ü
and C3 B6
for ö
etc.)
Yet then there was the following sequence I don't know how to get to the expected char:
C3 83 EF BF BF
From the context I can gather that it should represent the character ü. Yet I've no idea how you could possibly get to that sequence...
Example how this looks like in the file (Hex View):
54 65 73 74 20 77 69 74 68 20 63 68 61 72 20 22
75 65 22 20 2D 3E 20 C3 BC 20 0D 0A 54 65 73 74
20 77 69 74 68 20 63 68 61 72 20 22 6F 65 22 20
2D 3E 20 C3 B6 0A 0D 0A 4E 6F 77 20 74 68 61 74
20 73 74 72 61 6E 67 65 20 73 65 71 75 65 6E 63
65 3A 20 C3 83 EF BF BF 20 69 74 20 73 68 6F 75
6C 64 20 70 72 6F 62 61 62 6C 79 20 72 65 70 72
65 73 65 6E 74 20 74 68 65 20 63 68 61 72 20 C3
BC
实际文本(UTF -8):
Actual text (UTF-8):
Test with char "ue" -> ü
Test with char "oe" -> ö
Now that strange sequence: Ã it should probably represent the char ü
(好像CP看起来不会让我显示EF BF BF的解码值;))
我在十六进制视图和文本视图中的表示中突出显示了相应的部分。
现在的问题是:
应该 C3 83 EF BF BF
代表什么?我想 C3 83
转换为Ã
但是什么是 EF BF BF
?我发现的唯一一件事是,如果你将字符0xFFFF转换为UTF-8 EF BF BF
是你得到的字节序列。但仍然:它究竟代表什么?
(Well looks like CP won't let me display the decode value of EF BF BF ;) )
I've highlighted the according sections in the Hex View and the Representation in the text View.
Now the question:
What should C3 83 EF BF BF
represent? I suppose C3 83
translates okay to Ã
but what is EF BF BF
? The only thing I found was that if you convert the char 0xFFFF to UTF-8 EF BF BF
is the byte sequence that you get. But still: what should it exactly represent?
推荐答案
我认为你的序列C3 83 EF BF BF
是使用ANSI序列的其他UTF8编码的结果C3 BC
。
让我解释一下:
1)当尝试将charC3
转换为UTF8时,您将获得C3 83
2)如果CodePage中不知道BC
,则Unicode结果可能是FF FF
3)编码为UTF8,Unicode结果将生成EF BF BF
总结:
C3 BC
使用代码页转换为Unicode(不知道哪一个,但不知道UTF8)。
这将导致C3 00 FF FF
(因为BC
在使用的代码页中未知。
然后这个结果从Unicode编码为UTF8到
C3 83 EF BF BF
我认为错误发生在程序生成中g你的源文件。
I think your sequenceC3 83 EF BF BF
is the result of an other UTF8 encoding with the "ANSI" sequenceC3 BC
.
Let me explain:
1) when trying to convert charC3
to UTF8, you will getC3 83
2) ifBC
is not known in the CodePage, the Unicode result might beFF FF
3) Encoding to UTF8 the Unicode result will generateEF BF BF
in conclusion:
C3 BC
is converted to Unicode using a codepage (don't know which one, but not UTF8).
This will result inC3 00 FF FF
(becauseBC
is not known in the used codepage.
Then this result is encoded from Unicode to UTF8 to
C3 83 EF BF BF
I think the error is in the program generating your source file.
这篇关于UTF-8编码文本中的未知字符的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!