哪些字符不能直接从Cp1252映射到UTF-8? [英] What characters do not directly map from Cp1252 to UTF-8?

查看:197
本文介绍了哪些字符不能直接从Cp1252映射到UTF-8?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我已经阅读了几个stackoverflow答案,即从Cp1252(又名Windows-1252;它们相同,不是吗?)转换为UTF时,某些字符不会直接映射(甚至是不可映射"). -8,例如此处: https://stackoverflow.com/a/23399926/2018047

I've read in several stackoverflow answers that some characters do not directly map (or are even "unmappable") when converting from Cp1252 (aka Windows-1252; they're the same, aren't they?) to UTF-8, e.g. here: https://stackoverflow.com/a/23399926/2018047

有人可以对此进一步说明吗?这是否意味着如果我批量/批量将源代码从cp1252转换为utf-8,我会得到一些最终会变成垃圾的字符?

Can someone please shed some more light on this? Does that mean that if I batch/mass convert source code from cp1252 to utf-8 I'll get some characters that will end up as garbage?

推荐答案

Windows就是这样的看起来像是1252代码页.

如您所见,字节0x81、0x8D,0x8F,0x90、0x9D没有分配任何内容.

As you can see, bytes 0x81, 0x8D, 0x8F, 0x90, 0x9D do not have anything assigned to them.

如果输入文件包含这些字节,并且将其视为Windows 1252编码,则这些字节将被视为无效字符.在正常情况下,这意味着输入文件不在Windows 1252中.

If your input file contains those bytes, and you treat it as if it was in Windows 1252 encoding, those bytes will be treated as invalid characters. In normal circumstances, this means that the input file was not in Windows 1252.

所有其他字节编码可打印字符或控制字符,并且所有这些字符均以Unicode表示,因此可以明确地以UTF-8编码.

All other bytes encode either printable characters or control characters, and all those characters are present in Unicode and therefore can unambiguously be encoded in UTF-8.

我不知道链接的答案试图声明什么,它的最后一段听起来像胡说八道.

I have no idea what the linked answer is trying to claim, its last paragraph sounds like nonsense.

更多的言论,可能会为您想了解的内容增光添彩:

Several more remarks, which may shine some light on what you are trying to get to know:

  • UTF-8和Windows 1252在ASCII之外彼此完全不兼容

  • UTF-8 and Windows 1252 are totally incompatible with each other outside ASCII

这两种编码都永远不会将文本编码为某些字节值,每种情况下都是不同的

both of those encodings will never encode text to certain byte values, different ones in each case

此外,某些字节序列在UTF-8中也无效

moreover, certain byte sequences are also invalid in UTF-8

通常,如果您将文件视为包含以UTF-8或Windows 1252编码的文本,但是如果不包含,则会丢失和破坏数据

in general, if you treat a file as if it contained text encoded in UTF-8 or Windows 1252, but it doesn't, you will lose and corrupt data

您可以在IDE或编辑器中选择文件的编码.建议仅使用UTF-8.您将必须转换现有的Windows 1252文件.

You can select the encoding of your files in your IDE or editor. It's recommended to go UTF-8 only. You will have to convert existing Windows 1252 files.

这篇关于哪些字符不能直接从Cp1252映射到UTF-8?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆