关闭双引号的未知 UTF-8 代码单元 [英] Unknown UTF-8 code units closing double quotes

查看：26 发布时间：2021/9/15 19:43:18 java xml utf-8 saxparser

本文介绍了关闭双引号的未知 UTF-8 代码单元的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我的问题如下.我正在阅读一个 XML 文件，其文本节点部分包含 UTF-8 版本的开始和结束双引号.文本被提取，缩短为 3999 字节，并放入新的 XML 格式，然后保存为文件.

My problem is as follows. I am reading in an XML-file whose text nodes partially contain the UTF-8 version of opening and closing double quotes. The text is extracted, shortened to 3999 bytes and put into a new XML-Format, which is then saved as a file.

虽然 Notepad++ 在输入文件中正确显示了这两个符号，但输出文件包含无效的 utf-8 字符，甚至 Notepad++ 都无法显示.

While both signs are displayed correctly by Notepad++ in the input file, the output file contains invalid utf-8 characters, not even Notepad++ is able to display.

开头的双引号打印正确，但结尾的双引号毁容.

The openeing double quotes are printed correctly, but the closing ones are disfigured.

使用十六进制编辑器，我发现代码单元以某种方式改变了

Using a Hex-Editor, I found ot that the code units are somehow changed from

E2 80 9D

在输入文件中

E2 80 3F

在输出文件中.我正在使用 sax 解析器进行 xml 解析.

in the output file. I am using the sax-parser for the xml-parsing.

是否有任何已知的错误可能导致这种行为?

Are there any known bugs that could cause such a behaviour?

推荐答案

不是一个已知的错误，而是在读取或写入文件时遗漏编码的常见错误 - 导致使用的平台默认编码是 Windows-1252案例.

Not a known bug but a common mistake to leave encoding out when reading files or writing them - resulting in the platform default encoding used which is Windows-1252 in this case.

最初读取文件时，应指定 UTF-8 解码，写入新文件时，应指定 UTF-8 编码.如果您发布您的实现，我可以就地更正.

When you initially read the file, you should specify UTF-8 decoding and when writing to a new file, you should do specify UTF-8 encoding. If you post your implementation I can correct it in place.

如何复制:

byte[] quoteutf8 = {(byte)0xE2, (byte)0x80, (byte)0x9D};
String decodedPlatformDefault = new String(quoteutf8, "Windows-1252");
byte[] encodedPlatformDefault = decodedPlatformDefault.getBytes("Windows-1252");

for( byte i : encodedPlatformDefault ) {
    System.out.print(String.format( "%02x ", i ));
   //e2 80 3f   
}

这篇关于关闭双引号的未知 UTF-8 代码单元的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

关闭双引号的未知 UTF-8 代码单元 [英] Unknown UTF-8 code units closing double quotes

问题描述

推荐答案

相关文章

Java开发最新文章

热门教程

热门工具

登录关闭

关闭双引号的未知 UTF-8 代码单元 [英] Unknown UTF-8 code units closing double quotes

问题描述

推荐答案

相关文章

Java开发最新文章

热门教程

热门工具

登录 关闭

登录关闭