Windows记事本中奇怪的utf8解码错误 [英] Strange utf8 decoding error in windows notepad

查看:72
本文介绍了Windows记事本中奇怪的utf8解码错误的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

如果在以utf8编码的文本文件中键入以下字符串(不带bom),然后使用notepad.exe打开它,则会在屏幕上显示一些奇怪的字符.但是记事本实际上可以很好地解码此字符串,而无需最后一个"a".非常奇怪的行为.我正在使用Windows 10 1809.

If you type the following string into a text file encoded with utf8(without bom) and open it with notepad.exe,you will get some weired characters on screen. But notepad can actually decode this string well without the last 'a'. Very strange behavior. I am using Windows 10 1809.

[19, 16, 12, 14, 15, 15, 12, 17, 18, 15, 14, 15, 19, 13, 20, 18, 16, 19, 14, 16, 20, 16, 18, 12, 13, 14, 15, 20, 19, 17, 14, 17, 18, 16, 13, 12, 17, 14, 16, 13, 13, 12, 15, 20, 19, 15, 19, 13, 18, 19, 17, 14, 17, 18, 12, 15, 18, 12, 19, 15, 12, 19, 18, 12, 17, 20, 14, 16, 17, 18, 15, 12, 13, 19, 18, 17, 18, 14, 19, 18, 16, 15, 18, 17, 15, 15, 19, 16, 15, 14, 19, 13, 19, 15, 17, 16, 12, 12, 18, 12, 14, 12, 16, 19, 12, 19, 12, 17, 19, 20, 19, 17, 19, 20, 16, 19, 16, 19, 16, 12, 12, 18, 19, 17, 18, 16, 12, 17, 13, 18, 20, 19, 18, 20, 14, 16, 13, 12, 12, 14, 13, 19, 17, 20, 18, 15, 12, 15, 20, 14, 16, 15, 16, 19, 20, 20, 12, 17, 13, 20, 16, 20, 13a

我想知道这是Windows错误还是可以解决此问题.

I wonder if this is a windows bug or there is something I can do to solve this.

推荐答案

进行了更多研究;弄清楚了.

Did more research; figured it out.

似乎是经典案例布什隐藏事实"的变体. https://en.wikipedia.org/wiki/Bush_hid_the_facts

Seems like a variation of the classic case of "Bush hid the facts". https://en.wikipedia.org/wiki/Bush_hid_the_facts

看起来记事本的默认字符编码用于保存文件与用于打开文件的默认字符编码不同.是的,这似乎是一个错误.

It looks like Notepad has a different character encoding default for saving a file than it does for opening a file. Yes, this does seem like a bug.

但是对于发生的情况有一个实际的解释:

But there is an actual explanation for what is occurring:

  1. 记事本检查BOM表字节序列.如果找不到,则有2个选项:编码为UTF-16 Little Endian(无BOM)或纯ASCII.它首先使用一个称为IsTextUnicode的函数来检查UTF-16 LE.

  1. Notepad checks for a BOM byte sequence. If it does not find one, it has 2 options: the encoding is either UTF-16 Little Endian (without BOM) or plain ASCII. It checks for UTF-16 LE first using a function called IsTextUnicode.

IsTextUnicode对给定的文本是否为Unicode进行一系列测试,以进行猜测.这些测试之一是IS_TEXT_UNICODE_STATISTICS,它使用统计分析.如果测试为真,则给定的文本可能是Unicode,但不能保证绝对确定性.
https://docs.microsoft.com/zh-cn/windows/desktop/api/winbase/nf-winbase-istextunicode

IsTextUnicode runs a series of tests to guess whether the given text is Unicode or not. One of these tests is IS_TEXT_UNICODE_STATISTICS, which uses statistical analysis. If the test is true, then the given text is probably Unicode, but absolute certainty is not guaranteed.
https://docs.microsoft.com/en-us/windows/desktop/api/winbase/nf-winbase-istextunicode

如果IsTextUnicode返回true,则记事本将使用UTF-16 LE对文件进行编码,从而产生您看到的奇怪输出.我们可以用字符ㄠ来确认.其对应的ASCII字符为"1"(空格一);这些ASCII字符的对应十六进制值分别为0x20(空格)和0x31(1).由于字节顺序为Little Endian,因此Unicode代码点的顺序为'1'或U + 3120,您可以确认是否查找该代码点.
https://unicode-table.com/en/3120/

If IsTextUnicode returns true, Notepad encodes the file with UTF-16 LE, producing the strange output you saw. We can confirm this with this character ㄠ. Its corresponding ASCII characters are ' 1' (space one); the corresponding hex values for those ASCII characters are 0x20 for space and 0x31 for one. Since the byte-ordering is Little Endian, the order for the Unicode code point would be '1 ', or U+3120, which you can confirm if you look up that code point.
https://unicode-table.com/en/3120/

如果要解决此问题,则需要中断模式,该模式有助于IsTextUnicode确定给定的文本是否为Unicode.您可以在文本之前插入换行符以破坏模式.

If you want to solve the issue, you need to break the pattern which helps IsTextUnicode determine if the given text is Unicode. You can insert a newline before the text to break the pattern.

希望有帮助!

这篇关于Windows记事本中奇怪的utf8解码错误的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆