修复XML文件中的错误编码 [英] Repairing wrong encoding in XML files

查看:264
本文介绍了修复XML文件中的错误编码的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我们的一个提供商有时会发送标记为UTF-8编码文档的XML订阅源,但包含不包含在UTF-8字符集中的字符。这会导致解析器抛出异常,并在遇到这些字符时停止构建DOM对象:

One of our providers are sometimes sending XML feeds that are tagged as UTF-8 encoded documents but includes characters that are not included in the UTF-8 charset. This causes the parser to throw an exception and stop building the DOM object when these characters are encountered:

DocumentBuilder.parse(ByteArrayInputStream bais) 

抛出以下异常:

org.xml.sax.SAXParseException: Invalid byte 2 of 2-byte UTF-8 sequence.

是否有一种方法可以早日捕获这些问题并避免异常(即查找和删除这些问题来自流的字符)?我正在寻找的是对于错误编码的文档的尽力而为的回退类型。正确的解决方案显然是在源代码中攻击问题,并确保只传递正确的文档,但是在不可能的情况下,什么是好的方法?

Is there a way to "capture" these problems early and avoid the exception (i.e. finding and removing those characters from the stream)? What I'm looking for is a "best effort" type of fallback for wrongly encoded documents. The correct solution would obviously be to attack the problem at the source and make sure that only correct documents are delivered, but what is a good approach when that is not possible?

推荐答案

如果问题真的是错误的编码(而不是混合编码),则不需要重新编码文档来解析它。只需将其解析为Reader,而不是InputStream,而dom解析器将忽略该标题:

if the problem truly is the wrong encoding (as opposed to a mixed encoding), you don't need to re-encode the document to parse it. just parse it as a Reader instead of an InputStream and the dom parser will ignore the header:

DocumentBuilder.parse(new InpputSource(new InputStreamReader(inputStream, "<real encoding>")));

这篇关于修复XML文件中的错误编码的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆