关闭双引号的未知 UTF-8 代码单元 [英] Unknown UTF-8 code units closing double quotes
问题描述
我的问题如下.我正在阅读一个 XML 文件,其文本节点部分包含 UTF-8 版本的开始和结束双引号.文本被提取,缩短为 3999 字节,并放入新的 XML 格式,然后保存为文件.
My problem is as follows. I am reading in an XML-file whose text nodes partially contain the UTF-8 version of opening and closing double quotes. The text is extracted, shortened to 3999 bytes and put into a new XML-Format, which is then saved as a file.
虽然 Notepad++ 在输入文件中正确显示了这两个符号,但输出文件包含无效的 utf-8 字符,甚至 Notepad++ 都无法显示.
While both signs are displayed correctly by Notepad++ in the input file, the output file contains invalid utf-8 characters, not even Notepad++ is able to display.
开头的双引号打印正确,但结尾的双引号毁容.
The openeing double quotes are printed correctly, but the closing ones are disfigured.
使用十六进制编辑器,我发现代码单元以某种方式改变了
Using a Hex-Editor, I found ot that the code units are somehow changed from
E2 80 9D
在输入文件中
E2 80 3F
在输出文件中.我正在使用 sax 解析器进行 xml 解析.
in the output file. I am using the sax-parser for the xml-parsing.
是否有任何已知的错误可能导致这种行为?
Are there any known bugs that could cause such a behaviour?
推荐答案
不是一个已知的错误,而是在读取或写入文件时遗漏编码的常见错误 - 导致使用的平台默认编码是 Windows-1252案例.
Not a known bug but a common mistake to leave encoding out when reading files or writing them - resulting in the platform default encoding used which is Windows-1252 in this case.
最初读取文件时,应指定 UTF-8 解码,写入新文件时,应指定 UTF-8 编码.如果您发布您的实现,我可以就地更正.
When you initially read the file, you should specify UTF-8 decoding and when writing to a new file, you should do specify UTF-8 encoding. If you post your implementation I can correct it in place.
如何复制:
byte[] quoteutf8 = {(byte)0xE2, (byte)0x80, (byte)0x9D};
String decodedPlatformDefault = new String(quoteutf8, "Windows-1252");
byte[] encodedPlatformDefault = decodedPlatformDefault.getBytes("Windows-1252");
for( byte i : encodedPlatformDefault ) {
System.out.print(String.format( "%02x ", i ));
//e2 80 3f
}
这篇关于关闭双引号的未知 UTF-8 代码单元的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!