UTF-8编码文本中的未知字符 [英] Unknown character in UTF-8 encoded text

查看:116
本文介绍了UTF-8编码文本中的未知字符的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个包含一些数据的文件。该数据以UTF-8编码(没有BOM)



这些字节通常没有问题需要处理。然而在该文件中知道有一个字节序列我不知道它应该代表什么(我也不能找到任何关于它的信息)



检查日期我在十六进制编辑器中打开了文件。有UTF-8 char序列非常正常( C3 BC ü C3 B6 ö等等。)



然而有以下顺序我不知道如何到达预期的字符:



C3 83 EF BF BF



从上下文中我可以知道它应该代表字符ü。然而,我不知道你怎么可能达到那个序列...





示例文件中的示例( Hex View):

I have a file which contains some data. That data is encoded in UTF-8 (without a BOM)

Those bytes are usually no problem to handle. Yet know in that file there is a byte sequence I don't know what it should represent (neither could I find any information about it too)

To examine the date I opened the file in a hex editor. There were UTF-8 char sequences which were pretty normal (C3 BC for ü and C3 B6 for ö etc.)

Yet then there was the following sequence I don't know how to get to the expected char:

C3 83 EF BF BF

From the context I can gather that it should represent the character ü. Yet I've no idea how you could possibly get to that sequence...


Example how this looks like in the file (Hex View):

54 65 73 74 20 77 69 74 68 20 63 68 61 72 20 22 
75 65 22 20 2D 3E 20 C3 BC 20 0D 0A 54 65 73 74 
20 77 69 74 68 20 63 68 61 72 20 22 6F 65 22 20 
2D 3E 20 C3 B6 0A 0D 0A 4E 6F 77 20 74 68 61 74 
20 73 74 72 61 6E 67 65 20 73 65 71 75 65 6E 63 
65 3A 20 C3 83 EF BF BF 20 69 74 20 73 68 6F 75 
6C 64 20 70 72 6F 62 61 62 6C 79 20 72 65 70 72 
65 73 65 6E 74 20 74 68 65 20 63 68 61 72 20 C3 
BC





实际文本(UTF -8):





Actual text (UTF-8):

Test with char "ue" -> ü
Test with char "oe" -> ö

Now that strange sequence: Ã it should probably represent the char ü



(好像CP看起来不会让我显示EF BF BF的解码值;))



我在十六进制视图和文本视图中的表示中突出显示了相应的部分。



现在的问题是:



应该 C3 83 EF BF BF 代表什么?我想 C3 83 转换为Ã但是什么是 EF BF BF ?我发现的唯一一件事是,如果你将字符0xFFFF转换为UTF-8 EF BF BF 是你得到的字节序列。但仍然:它究竟代表什么?


(Well looks like CP won't let me display the decode value of EF BF BF ;) )

I've highlighted the according sections in the Hex View and the Representation in the text View.

Now the question:

What should C3 83 EF BF BF represent? I suppose C3 83 translates okay to à but what is EF BF BF? The only thing I found was that if you convert the char 0xFFFF to UTF-8 EF BF BF is the byte sequence that you get. But still: what should it exactly represent?

推荐答案

我认为你的序列 C3 83 EF BF BF 是使用ANSI序列的其他UTF8编码的结果 C3 BC



让我解释一下:

1)当尝试将char C3 转换为UTF8时,您将获得 C3 83

2)如果CodePage中不知道 BC ,则Unicode结果可能是 FF FF

3)编码为UTF8,Unicode结果将生成 EF BF BF



总结:

C3 BC 使用代码页转换为Unicode(不知道哪一个,但不知道UTF8)。

这将导致 C3 00 FF FF (因为 BC 在使用的代码页中未知。

然后这个结果从Unicode编码为UTF8到

C3 83 EF BF BF



我认为错误发生在程序生成中g你的源文件。
I think your sequence C3 83 EF BF BF is the result of an other UTF8 encoding with the "ANSI" sequence C3 BC.

Let me explain:
1) when trying to convert char C3 to UTF8, you will get C3 83
2) if BC is not known in the CodePage, the Unicode result might be FF FF
3) Encoding to UTF8 the Unicode result will generate EF BF BF

in conclusion:
C3 BC is converted to Unicode using a codepage (don't know which one, but not UTF8).
This will result in C3 00 FF FF (because BC is not known in the used codepage.
Then this result is encoded from Unicode to UTF8 to
C3 83 EF BF BF

I think the error is in the program generating your source file.


这篇关于UTF-8编码文本中的未知字符的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆