如何检测非法的UTF-8字节序列来替代它们在java输入流? [英] How to detect illegal UTF-8 byte sequences to replace them in java inputstream?

查看:245
本文介绍了如何检测非法的UTF-8字节序列来替代它们在java输入流?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

有问题的文件不在我的控制之下。大多数字节序列是有效的UTF-8,它不是ISO-8859-1(或其他编码)。
我想尽最大努力提取尽可能多的信息。

The file in question is not under my control. Most byte sequences are valid UTF-8, it is not ISO-8859-1 (or an other encoding). I want to do my best do extract as much information as possible.

该文件包含一些非法字节序列,那些应该替换为替换字符

The file contains a few illegal byte sequences, those should be replaces with the replacement character.

这不是一件容易的事情,它认为需要一些有关UTF-8状态机的知识。

It's not an easy task, it think it requires some knowledge about the UTF-8 state machine.

Oracle有一个可以满足我需要的包装器:

UTF8ValidationFilter javadoc

Oracle has a wrapper which does what I need:
UTF8ValidationFilter javadoc

有没有类似的东西(商业或免费软件)?

Is there something like that available (commercially or as free software)?

感谢

-stephan

Thanks
-stephan

解决方案:

final BufferedInputStream in = new BufferedInputStream(istream);
final CharsetDecoder charsetDecoder = StandardCharsets.UTF_8.newDecoder();
charsetDecoder.onMalformedInput(CodingErrorAction.REPLACE);
charsetDecoder.onUnmappableCharacter(CodingErrorAction.REPLACE);
final Reader inputReader = new InputStreamReader(in, charsetDecoder);


推荐答案

java.nio.charset.CharsetDecoder 做你所需要的。此类为不同类型的错误提供了字符集解码与用户定义的操作(请参阅 onMalformedInput() onUnmappableCharacter() )。

java.nio.charset.CharsetDecoder does what you need. This class provides charset decoding with user-definable actions on different kinds of errors (see onMalformedInput() and onUnmappableCharacter()).

CharsetDecoder 写入 OutputStream ,您可以使用 InputStream /java/io/PipedOutputStream.htmlrel =noreferrer> java.io.PipedOutputStream ,有效创建过滤的 InputStream

CharsetDecoder writes to an OutputStream, which you can pipe into an InputStream using java.io.PipedOutputStream, effectively creating a filtered InputStream.

这篇关于如何检测非法的UTF-8字节序列来替代它们在java输入流?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆