为什么BufferedReader的read()比readLine()慢得多? [英] Why is BufferedReader read() much slower than readLine()?

查看:118
本文介绍了为什么BufferedReader的read()比readLine()慢得多?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我需要一次读取一个字符一个文件,并且正在使用 BufferedReader 中的 read()方法.*

I need to read a file one character at a time and I'm using the read() method from BufferedReader. *

我发现 read()大约比 readLine()慢10倍.这是预期的吗?还是我做错了什么?

I found that read() is about 10x slower than readLine(). Is this expected? Or am I doing something wrong?

这是Java 7的基准测试.输入测试文件包含大约500万行和2.54亿个字符(〜242 MB)**:

Here's a benchmark with Java 7. The input test file has about 5 million lines and 254 million characters (~242 MB) **:

read()方法大约需要7000毫秒来读取所有字符:

The read() method takes about 7000 ms to read all the characters:

@Test
public void testRead() throws IOException, UnindexableFastaFileException{

    BufferedReader fa= new BufferedReader(new FileReader(new File("chr1.fa")));

    long t0= System.currentTimeMillis();
    int c;
    while( (c = fa.read()) != -1 ){
        //
    }
    long t1= System.currentTimeMillis();
    System.err.println(t1-t0); // ~ 7000 ms

}

readLine()方法仅花费约700毫秒:

The readLine() method takes only ~700 ms:

@Test
public void testReadLine() throws IOException{

    BufferedReader fa= new BufferedReader(new FileReader(new File("chr1.fa")));

    String line;
    long t0= System.currentTimeMillis();
    while( (line = fa.readLine()) != null ){
        //
    }
    long t1= System.currentTimeMillis();
    System.err.println(t1-t0); // ~ 700 ms
}


* 实用目的:我需要知道每行的长度,包括换行符( \ n \ r \ n )和剥线后的线长.我还需要知道一行是否以> 字符开头.对于给定的文件,仅在程序启动时执行一次.由于 BufferedReader.readLine()不会返回EOL字符,因此我采用了 read()方法.如果有更好的方法,请说.


* Practical purpose: I need to know the length of each line, including the newline characters (\n or \r\n) AND the line length after stripping them. I also need to know if a line starts with the > character. For a given file this is done only once at the start of the program. Since EOL chars are not returned by BufferedReader.readLine() I'm resorting on the read() method. If there are better ways of doing this, please say.

**压缩文件位于此处 http://hgdownload.cse.ucsc.edu/goldenpath/hg19/chromosomes/chr1.fa.gz .对于那些可能想知道的人,我正在编写一个类来为fasta文件建立索引.

** The gzipped file is here http://hgdownload.cse.ucsc.edu/goldenpath/hg19/chromosomes/chr1.fa.gz. For those who may be wondering, I'm writing a class to index fasta files.

推荐答案

分析性能时,重要的是在开始之前具有有效的基准.因此,让我们从一个简单的JMH基准开始,该基准显示了预热后我们的预期性能.

The important thing when analyzing performance is to have a valid benchmark before you start. So let's start with a simple JMH benchmark that shows what our expected performance after warmup would be.

我们必须考虑的一件事是,由于现代操作系统喜欢缓存定期访问的文件数据,因此我们需要某种方法来清除测试之间的缓存.在Windows上,有一个小的实用程序可以做到这一点-在Linux上,您应该可以通过在某个地方写入一些伪文件来做到这一点.

One thing we have to consider is that since modern operating systems like to cache file data that is accessed regularly we need some way to clear the caches between tests. On Windows there's a small little utility that does just this - on Linux you should be able to do it by writing to some pseudo file somewhere.

代码如下:

import org.openjdk.jmh.annotations.Benchmark;
import org.openjdk.jmh.annotations.BenchmarkMode;
import org.openjdk.jmh.annotations.Fork;
import org.openjdk.jmh.annotations.Mode;

import java.io.BufferedReader;
import java.io.FileReader;
import java.io.IOException;

@BenchmarkMode(Mode.AverageTime)
@Fork(1)
public class IoPerformanceBenchmark {
    private static final String FILE_PATH = "test.fa";

    @Benchmark
    public int readTest() throws IOException, InterruptedException {
        clearFileCaches();
        int result = 0;
        try (BufferedReader reader = new BufferedReader(new FileReader(FILE_PATH))) {
            int value;
            while ((value = reader.read()) != -1) {
                result += value;
            }
        }
        return result;
    }

    @Benchmark
    public int readLineTest() throws IOException, InterruptedException {
        clearFileCaches();
        int result = 0;
        try (BufferedReader reader = new BufferedReader(new FileReader(FILE_PATH))) {
            String line;
            while ((line = reader.readLine()) != null) {
                result += line.chars().sum();
            }
        }
        return result;
    }

    private void clearFileCaches() throws IOException, InterruptedException {
        ProcessBuilder pb = new ProcessBuilder("EmptyStandbyList.exe", "standbylist");
        pb.inheritIO();
        pb.start().waitFor();
    }
}

,如果我们使用

chcp 65001 # set codepage to utf-8
mvn clean install; java "-Dfile.encoding=UTF-8" -server -jar .\target\benchmarks.jar

我们得到以下结果(为我清除缓存大约需要2秒钟,而我正在HDD上运行它,这就是为什么它比您慢很多的原因):

we get the following results (about 2 seconds are needed to clear the caches for me and I'm running this on a HDD so that's why it's a good deal slower than for you):

Benchmark                            Mode  Cnt  Score   Error  Units
IoPerformanceBenchmark.readLineTest  avgt   20  3.749 ± 0.039   s/op
IoPerformanceBenchmark.readTest      avgt   20  3.745 ± 0.023   s/op

惊奇!不出所料,在JVM进入稳定模式之后,这里根本没有性能差异.但是readCharTest方法中有一个异常值:

Surprise! As expected there's no performance difference here at all after the JVM has settled into a stable mode. But there is one outlier in the readCharTest method:

# Warmup Iteration   1: 6.186 s/op
# Warmup Iteration   2: 3.744 s/op

可以解决您遇到的问题.我能想到的最可能的原因是OSR在这里做得不好,或者JIT运行太晚而无法在第一次迭代中有所作为.

which is exaclty the problem you're seeing. The most likely reason I can think of is that OSR isn't doing a good job here or that the JIT is only running too late to make a difference on the first iteration.

根据您的用例,这可能是一个大问题,或者可以忽略不计(如果您要读取一千个文件,那没关系,如果您只读取一个文件,这是一个问题).

Depending on your use case this might be a big problem or negligible (if you're reading a thousand files it won't matter, if you're only reading one this is a problem).

解决这种问题并不容易,尽管有多种方法可以解决,但没有通用的解决方案.判断我们是否走上正确轨道的一个简单测试是使用 -Xcomp 选项运行代码,该选项强制HotSpot在第一次调用时编译每个方法.实际上,这样做会导致第一次调用时的大量延迟消失:

Solving such a problem is not easy and there are no general solutions, although there are ways to handle this. One easy test to see if we're on the right track is to run the code with the -Xcomp option which forces HotSpot to compile every method on the first invocation. And indeed doing so, causes the large delay at the first invocation to disappear:

# Warmup Iteration   1: 3.965 s/op
# Warmup Iteration   2: 3.753 s/op

可能的解决方案

现在我们已经很好地知道了实际的问题是什么(我猜仍然是所有这些锁既没有合并也没有使用有效的偏向锁实现),解决方案非常简单明了:减少函数调用的数量(所以是的,我们可以在没有上述所有内容的情况下获得此解决方案,但是能够很好地解决这个问题总是很高兴的,并且可能有一个解决方案,不需要更改太多代码).

Now that we have a good idea what the actual problem is (my guess is still all those locks neither being coalesced nor using the efficient biased locks implementation), the solution is rather straight forward and simple: Reduce the number of function calls (so yes we could've arrived at this solution without everything above, but it's always nice to have a good grip on the problem and there might have been a solution that didn't involve changing much code).

以下代码始终比其他两个代码运行速度更快-您可以使用数组大小​​进行操作,但它出奇地不重要(大概是因为与其他方法相反, read(char [])却没有)必须先获得一个锁,才能使每次通话的费用降低).

The following code runs consistently faster than either of the other two - you can play with the array size but it's surprisingly unimportant (presumably because contrary to the other methods read(char[]) does not have to acquire a lock so the cost per call is lower to begin with).

private static final int BUFFER_SIZE = 256;
private char[] arr = new char[BUFFER_SIZE];

@Benchmark
public int readArrayTest() throws IOException, InterruptedException {
    clearFileCaches();
    int result = 0;
    try (BufferedReader reader = new BufferedReader(new FileReader(FILE_PATH))) {
        int charsRead;
        while ((charsRead = reader.read(arr)) != -1) {
            for (int i = 0; i < charsRead; i++) {
                result += arr[i];
            }
        }
    }
    return result;
} 

这很可能是足够好的性能,但是如果您想使用

This is most likely good enough performance wise, but if you wanted to improve performance even further using a file mapping might (wouldn't count on too large an improvement in a case such as this, but if you know that your text is always ASCII, you could make some further optimizations) further help performance.

这篇关于为什么BufferedReader的read()比readLine()慢得多?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆