如何用Java逐步解码大的多字节字符串文件? [英] How can I decode a large, multi-byte string file progressively in Java?

查看:287
本文介绍了如何用Java逐步解码大的多字节字符串文件?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个程序,可能需要处理可能包含多字节编码的大文件。我当前执行此操作的代码有一个问题,即创建一个存储整个文件的内存结构,如果文件很大,可能会导致内存不足错误:

I have a program that may need to process large files possibly containing multi-byte encodings. My current code for doing this has the problem that creates a memory structure to hold the entire file, which can cause an out of memory error if the file is large:

Charset charset = Charset.forName( "UTF-8" );
CharsetDecoder decoder = charset.newDecoder();
FileInputStream fis = new FileInputStream( file );
FileChannel fc = fis.getChannel();
int lenFile = (int)fc.size();
MappedByteBuffer bufferFile = fc.map( FileChannel.MapMode.READ_ONLY, 0, lenFile );
CharBuffer cb = decoder.decode( bufferFile );
// process character buffer
fc.close();

问题是,如果我使用较小的缓冲区将文件字节内容切碎,然后逐个输入解码器,则缓冲区可以在多字节序列的中间结束。我应该如何解决这个问题?

The problem is that if I chop up the file byte contents using a smaller buffer and feed it piecemeal to the decoder, then the buffer could end in the middle of a multi-byte sequence. How should I cope with this problem?

推荐答案

与使用阅读器

A CharsetDecoder 实际上是允许将字节解码为char的基础机制。简而言之,您可以这样说:

A CharsetDecoder is indeed the underlying mechanism which allows the decoding of bytes into chars. In short, you could say that:

// Extrapolation...
byte stream --> decoding       --> char stream
InputStream --> CharsetDecoder --> Reader

鲜为人知的事实是,大多数(但不是全部... JDK(例如从 FileReader 创建的文件,或仅具有字符集的 InputStreamReader 创建的文件)将具有 CodingErrorAction.REPLACE 的策略。效果是将输入中的任何无效字节序列替换为 Unicode替换字符(是的,那是臭名昭著的。)。

The less known fact is that most (but not all... See below) default decoders in the JDK (such as those created from a FileReader for instance, or an InputStreamReader with only a charset) will have a policy of CodingErrorAction.REPLACE. The effect is to replace any invalid byte sequence in the input with the Unicode replacement character (yes, that infamous �).

现在,如果您担心坏角色会滑入的能力,还可以选择的政策为报告。您也可以在读取文件时执行以下操作:这将对任何格式错误的字节序列引发 MalformedInputException

Now, if you are concerned about the ability for "bad characters" to slip in, you can also select to have a policy of REPORT. You can do that when reading a file, too, as follows; this will have the effect of throwing a MalformedInputException on any malformed byte sequence:

// This is 2015. File is obsolete.
final Path path = Paths.get(...);
final CharsetDecoder decoder = StandardCharsets.UTF_8.newDecoder()
    .onMalformedInput(CodingErrorAction.REPORT);

try (
    final InputStream in = Files.newInputStream(path);
    final Reader reader = new InputStreamReader(in, decoder);
) {
    // use the reader
}

在Java中出现了默认替换动作的一个例外8: Files.newBufferedReader(somePath)将始终尝试以UTF-8读取,并且默认操作为 REPORT

ONE EXCEPTION to that default replace action appears in Java 8: Files.newBufferedReader(somePath) will try and read in UTF-8, always, and with a default action of REPORT.

这篇关于如何用Java逐步解码大的多字节字符串文件?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆