手动将字符从UTF-8转换为ISO-8859-1 [英] Convert character from UTF-8 to ISO-8859-1 manually
问题描述
我有一个字符ö".如果我查看此UTF-8表,我会看到它具有十六进制值F6
.如果我查看 Unicode表,我会看到ö"具有索引E0
和16
.如果将两者都加,则将得到F6
的代码点的十六进制值.这是二进制值1111 0110
.
1)如何从十六进制值F6
到索引E0
和16
?
2)我不知道如何从F6
到两个字节C3
B6
...
因为我没有得到结果,所以我尝试另辟way径. ö"在ISO-8859-1中表示为Ã".在 UTF-8表中,我可以看到Ã"具有十进制值195
和¶"的十进制值为182
.转换为位,这是1100 0011 1011 0110
.
过程:
-
查看表并获取字符ö"的unicode.根据索引
E0
和16
计算得出UnicodeU+00F6
. -
根据wildplasser发布的算法,您可以计算编码后的UTF-8值
C3
和B6
. -
以二进制形式获得
1100 0011 1011 0110
,它对应于十进制值195
和182
. -
如果这些值被解释为 ISO 8859-1 (只有1个字节),那么您会收到ö".
PS:我还发现了此链接,其中显示了步骤2中的值.
您正在使用的页面使您有些困惑.您的"UTF-8表"或"Unicode表"都没有为您提供UTF-8中代码点的值.它们都只是列出了字符的Unicode值.
在Unicode中,每个字符(代码点")都分配有一个唯一的数字.字符ö
被分配了代码点U+00F6
,其十六进制为F6
,十进制为246
.
UTF-8是表示形式 Unicode,每个Unicode代码点使用1到4个字节的序列.那篇文章中描述了从32位Unicode代码点到UTF-8字节序列的转换-一旦您习惯了,它的操作就非常简单.当然,计算机一直在执行此操作,但是您可以通过铅笔和纸轻松地进行操作,并且只需稍加练习即可在脑海中进行操作.
如果执行该转换,您将看到U+00F6
转换为UTF-8序列C3 B6
或二进制形式的1100 0011 1011 0110
,这就是ö
的UTF-8表示形式的原因./p>
您的问题的另一半是关于ISO-8859-1的.这是一种字符编码,通常称为"拉丁1 ". Latin-1编码的数值与Unicode中的前256个代码点相同,因此在Latin-1中ö
是F6
.
一旦您已在UTF-8和标准Unicode代码点(UTF-32)之间进行了转换,则获取Latin-1编码应该很简单.但是,并非所有UTF-8序列/Unicode字符都具有对应的Latin-1字符.
请参阅优秀文章每个软件开发人员绝对,肯定地必须完全了解Unicode和字符集的绝对最低要求(没有借口!)可以更好地理解它们之间的字符编码和转换.
I have the character "ö". If I look in this UTF-8 table I see it has the hex value F6
. If I look in the Unicode table I see that "ö" has the indices E0
and 16
. If I add both I get the hex value of the code point of F6
. This is the binary value 1111 0110
.
1) How do I get from the hex value F6
to the indices E0
and 16
?
2) I don't know how to come from F6
to the two bytes C3
B6
...
Because I didn't got the results I tried to go the other way. "ö" is represented in ISO-8859-1 as "ö". In the UTF-8 table I can see that "Ã" has the decimal value 195
and "¶" has the decimal value 182
. Converted to bits this is 1100 0011 1011 0110
.
Process:
Look in a table and get the unicode for the character "ö". Calculated from the indices
E0
and16
you get the UnicodeU+00F6
.According to the algorithm posted by wildplasser you can calculate the coded UTF-8 value
C3
andB6
.In the binary form you get
1100 0011 1011 0110
which corresponds to the decimal values195
and182
.If these values are interpreted as ISO 8859-1 (only 1 byte) then you get "ö".
PS: I found also this link, which shows the values from step 2.
The pages you are using are confusing you somewhat. Neither your "UTF-8 table" or "Unicode table" are giving you the value of the code point in UTF-8. They are both simply listing the Unicode value of the characters.
In Unicode, every character ("code point") has a unique number assigned to it. The character ö
is assigned the code point U+00F6
, which is F6
in hexadecimal, and 246
in decimal.
UTF-8 is a representation of Unicode, using a sequence of between one and four bytes per Unicode code point. The transformation from 32-bit Unicode code points to UTF-8 byte sequences is described in that article - it is pretty simple to do, once you get used to it. Of course, computers do it all the time, but you can do it with a pencil and paper easily, and in your head with a bit of practice.
If you do that transformation, you will see that U+00F6
transforms to the UTF-8 sequence C3 B6
, or 1100 0011 1011 0110
in binary, which is why that is the UTF-8 representation of ö
.
The other half of your question is about ISO-8859-1. This is a character encoding commonly called "Latin-1". The numeric values of the Latin-1 encoding are the same as the first 256 code points in Unicode, thus ö
is F6
in Latin-1.
Once you have converted between UTF-8 and standard Unicode code points (UTF-32), it should be trivial to get the Latin-1 encoding. However, not all UTF-8 sequences / Unicode characters have corresponding Latin-1 characters.
See the excellent article The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) for a better understanding of character encodings and transformations between them.
这篇关于手动将字符从UTF-8转换为ISO-8859-1的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!