最快的方法来整理文本文件中的整数 [英] Fastest way to sum integers in text file

查看：135 发布时间：2017/11/4 20:45:27 java performance file-io

本文介绍了最快的方法来整理文本文件中的整数的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

假设你有一个很大的ASCII文本文件，每行有一个随机的非负整数，每个都在0到1,000,000,000范围内。文件中有100,000,000行。什么是最快的方式来读取文件，并计算所有整数的总和？

约束：我们有10 MB的RAM工作。这个文件的大小是1GB，所以我们不想读取整个文件，然后处理它。

以下是我尝试过的各种解决方案。我发现结果相当令人吃惊。

有没有更快的东西我错过了？

请注意：下面给出的所有时间总共运行算法 10次（运行一次，丢弃;启动计时器;运行10次;停止计时器）。这个机器是一个相当慢的Core 2 Duo。方法1：自然的方法

第一件事尝试是显而易见的方法：

pre $私有长sumLineByLine（）抛出NumberFormatException，IOException {
BufferedReader br = new BufferedReader新的FileReader（文件））;
字符串行;
总共= 0; （（line = br.readLine（））！= null）{
int k = Integer.parseInt（line）;
total + = k;
}
br.close（）;
总回报;

$ b $ p
$ b

请注意，最大可能的返回值是10 ^ 17，在 long 中，所以我们不必担心溢出。

在我的机器上运行这个11时间和折扣的第一次运行需要约92.9秒。

方法2：小调整

受这个问题的启发，我尝试不创建一个新的 int k 来存储解析行的结果，而不是直接将解析的值添加到 total 。所以这个：
$ b $ pre $ while（（line = br.readLine（））！= null）{ int k =的Integer.parseInt（线）; total + = k; $ / code $ / $ p

变成这样： < $（$）$（$）$（$）$（$）$（$）$（$）$（$）

我确信这不会有什么区别，并且认为编译器很有可能为这两个版本生成相同的字节码。但是，令我吃惊的是，它确实减少了一点时间：我们下降到92.1秒。

方法3：手动解析整数

到目前为止，困扰我的一件事是将 String 变成 int ，然后在最后添加它。当我们去时可能不会更快？如果我们自己解析 String ，会发生什么？像这样的东西...

  private long sumLineByLineManualParse（）throws NumberFormatException，
 IOException {
 BufferedReader br = new BufferedReader（new FileReader（file））; 
字符串行; 
总共= 0; （（line = br.readLine（））！= null）{
 char chs [] = line.toCharArray（）; 
 int mul = 1; 
 for（int i = chs.length  -  1; i> = 0; i--）{
 char c = chs [i]; 
 switch（c）{
 case'0'：
 break; 
 case'1'：
 total + = mul; 
 break; 
情况'2'：
 total + =（mul <1）; 
 break; 
情况'4'：
 total + =（mul <2）; 
 break; 
情况'8'：
 total + =（mul <3）; 
 break; 
 default：
 total + =（mul *（（byte）c  - （byte）（'0'）））; 
} 
 mul * = 10; 
} 
} 
 br.close（）; 
总回报; 
 
 
 
 $ b $ p 
 $ b这个我想可能会节省一点时间，尤其是在一些不错的优化为了做乘法。但是，转换为字符数组的开销必须弥补任何收益：现在需要 148.2秒。方法4：以二进制处理 h1> 
 
 我们可以尝试的最后一件事是将文件作为二进制数据处理。
 
 
从前面解析整数尴尬，如果你不知道它的长度。向后解析要容易得多：遇到的第一位数字是单位，下一位数字是十位，依此类推。所以最简单的方法就是向后读取文件。
 $ b 
如果我们分配一个 byte [] （比如说）8MB的缓冲区，我们可以用文件的最后8MB填充它，处理它，然后读取前面的8MB，依此类推。我们需要小心谨慎一点，当我们移动到下一个区块时，我们并没有搞砸一个数字，这是唯一的问题。
 
 $ b $当我们遇到一个数字时，我们把它加上（根据它在数字中的位置适当地相乘）到总和，然后乘以系数10，所以我们准备好下一个数字。如果我们遇到任何不是数字（CR或LF）的东西，我们只是重置系数。
  private long sumBinary （）throws IOException {
 RandomAccessFile raf = new RandomAccessFile（file，r）; 
 int lastRead =（int）raf.length（）; 
字节buf [] =新字节[8 * 1024 * 1024]; 
 int mul = 1; 
总共= 0; 
 while（lastRead> 0）{
 int len = Math.min（buf.length，lastRead）; 
 raf.seek（lastRead-len）; 
 raf.readFully（buf，0，len）; 
 lastRead- = len; 
 for（int i = len-1; i> = 0; i--）{
 // 48 is'0'and 57 is'9'
 if（（buf [i ] = 48）&（buf [i]≤57））{
 total + = mul *（buf [i] -48）; 
 mul * = 10; 
} else 
 mul = 1; 
} 
} 
 raf.close（）; 
总回报; 
} 
  
运行时间 30.8秒！这是一个速度比以前的最好的3倍。
 
 
 后续问题
 
 
  
  为什么这么快呢？我期待它能赢，但不是那么令人印象深刻。它主要是转换为 String 的开销吗？还有所有人担心在字符集之类的背景之下吗？
 
 我们可以通过使用 MappedByteBuffer 来做比这更好的事情帮帮我？我有一种感觉，调用方法从缓冲区读取的开销会减慢速度，特别是从缓冲区向后读时。   
 
阅读文件转发比转发更好向后，但仍然向后扫描缓冲区？这个想法是，你读取文件的第一个块，然后向后扫描，但在最后丢弃半数。然后，当你读下一个大块，你设置偏移量，以便您从您丢弃的数字开始读。
 
 有什么我没有想到，可以使一个重要的差异？
 
 
 更新：更令人吃惊的结果
 
 
 首先观察一下。我之前应该已经想到了，但是我认为，基于 String 的阅读效率低下的原因并不是创建所有 String 对象，但实际上它们寿命太短：我们已经有了1亿个垃圾收集器要处理。现在有些实验是基于人们发表的回答/评论。 
 
 
上午我作弊的缓冲区的大小？
 
 
 一个建议是，由于 BufferedReader 使用默认缓冲区为16KB ，我已经使用了8MB的缓冲区，我不喜欢与喜欢。如果您使用更大的缓冲区，它肯定会更快。
 
 
 这是震惊。 （code> sumBinary（））方法（方法4）在昨天用一个8MB缓冲区在30.8秒内运行。今天，代码不变，风向已经改变，我们在30.4秒。如果我将缓冲区大小降低到16KB以查看它的速度有多慢，那么它会变得更快！现在运行在 23.7秒。疯。谁看到一个来了？！
 
 
一些实验表明，16KB是最佳的。或许Java的人做了相同的实验，这就是为什么他们用16KB！
 
 
问题是I / O限制吗？
 
 
 我也想知道这个。在磁盘访问上花费了多少时间，以及在数据处理上花了多少时间？如果几乎所有的磁盘访问都是正确的，就像对其中一个建议的答案提供支持的评论所表明的那样，那么无论我们做什么，我们都不会有太大的进步。
 
 < 
 <$ c $>这个测试很容易通过运行代码来完成，所有解析和数字运算都被注释掉了。 c> private long sumBinary（）抛出IOException {
 RandomAccessFile raf = new RandomAccessFile（file，r）; 
 int lastRead =（int）raf.length（）; 
字节buf [] =新字节[16 * 1024]; 
 int mul = 1; 
总共= 0; 
 while（lastRead> 0）{
 int len = Math.min（buf.length，lastRead）; 
 raf.seek（lastRead  -  len）; 
 raf.readFully（buf，0，len）; 
 lastRead  -  = len; （（buf [i]> = 48）&&（buf [i，$ i $ b $ * ] <= 57））{
 total + = mul *（buf [i]  -  48）; 
 mul * = 10; 
} else 
 mul = 1; 
} * / 
} 
 raf.close（）; 
总回报; 
} 
  现在运行在 3.7秒！这看起来没有I / O限制。 
 
 
当然，一些I / O速度将来自磁盘缓存命中。但是这并不重要，我们仍然需要20秒的CPU时间（同时也使用Linux的 time 命令），这足够大到足以尝试减少它。
 
 
 向前扫描而不是向后扫描
 
 理由向后扫描文件，而不是转发。我没有解释得很好。这个想法是，如果您向前扫描一个数字，则必须累积扫描的数字的总值，然后将其添加。如果向后扫描，则可以随时将其添加到累计总数中。我的潜意识对自己有一定的意义（后面会提到），但是我错过了一个关键点，在其中一个答案中指出：向后扫描，我每次迭代都进行两次乘法，但是向前扫描你只需要一个。所以我编码了一个正向扫描版本：
 
 pre $ private $ long sumBinaryForward（）抛出IOException {
 RandomAccessFile raf = new RandomAccessFile（文件，r）; 
 int fileLength =（int）raf.length（）; 
字节buf [] =新字节[16 * 1024]; 
 int acc = 0; 
总共= 0; 
 int read = 0; 
 while（read< fileLength）{
 int len = Math.min（buf.length，fileLength  -  read）; 
 raf.readFully（buf，0，len）; 
 read + = len; （（buf [i]> = 48）&（buf [i] <= 57）为（int i = 0; i ） 
 acc = acc * 10 + buf [i]  -  48; 
 else {
 total + = acc; 
 acc = 0; 
} 
} 
} 
 raf.close（）; 
总回报; 
 
 
 
 $ b 运行时间 20.0秒扫描版本的距离。不错。 
 
 
乘法缓存
 
 
 我在夜间意识到的是，每次迭代的乘法，有可能使用缓存来存储这些乘法，所以我可以避免在向后迭代中执行它们。我很高兴地看到，当我醒来时，有人有同样的想法！
 
 
问题是，我们正在扫描的数字最多只有10位数字，只有10个可能的数字，所以累计总数只有100个数字的可能性。我们可以预先计算这些值，然后在反向扫描代码中使用它们。这应该击败向前扫描版本，因为我们现在完全摆脱了乘法。 （请注意，我们不能用正向扫描来完成这个工作，因为乘法是累加器，它可以取任何值到10 ^ 9，只有在后退的情况下，两个操作数才能被限制到几个可能性。） 
 
 
  private long sumBinaryCached（）抛出IOException {
 int mulCache [] [] = new int [10] ; 
 int coeff = 1; （int i = 0; i <10; i ++）{
 for（int j = 0; j <10; j ++）
 mulCache [i] [j] = coeff * j; 
 coeff * = 10; 
} 
 
 RandomAccessFile raf = new RandomAccessFile（file，r）; 
 int lastRead =（int）raf.length（）; 
字节buf [] =新字节[16 * 1024]; 
 int mul = 0; 
总共= 0; 
 while（lastRead> 0）{
 int len = Math.min（buf.length，lastRead）; 
 raf.seek（lastRead  -  len）; 
 raf.readFully（buf，0，len）; 
 lastRead  -  = len; （（buf [i]> = 48）&&（buf [i]< b 
（int i = len-1; i> = 0; ; = 57））
 total + = mulCache [mul ++] [buf [i]  -  48]; 
 else 
 mul = 0; 
} 
} 
 raf.close（）; 
总回报; 
} 
  
运行时间 26.1秒。令人失望的，至少可以说。在I / O方面，向后读取效率不高，但是我们已经看到I / O不是这里最头痛的问题。我曾预料到这会带来很大的积极影响。也许阵列查找和我们所取代的乘法一样昂贵。 （我曾尝试制作16x16的数组，并使用bitshifts进行索引，但没有帮助。）
 
 
 看起来正向扫描就是它的位置。
 
 
 使用MappedByteBuffer 
 
 
 下一步要添加的是 MappedByteBuffer ，看看是否比使用原始 RandomAccessFile 更高效。 
 
 pre $ private $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $'新的RandomAccessFile（文件，r）; 
字节buf [] =新字节[16 * 1024]; 
 final FileChannel ch = raf.getChannel（）; 
 int fileLength =（int）ch.size（）; 
 final MappedByteBuffer mb = ch.map（FileChannel.MapMode.READ_ONLY，0，
 fileLength）; 
 int acc = 0; 
总共= 0; 
 while（mb.hasRemaining（））{
 int len = Math.min（mb.remaining（），buf.length）; 
 mb.get（buf，0，len）; （（buf [i]> = 48）&&（buf [i]≤= 57）
 for（int i = 0; i< len; i ++） b $ b acc = acc * 10 + buf [i]  -  48; 
 else {
 total + = acc; 
 acc = 0; 
} 
} 
 ch.close（）; 
 raf.close（）; 
总回报; 
 
 $ / code>

这似乎有点改进：我们现在在 19.0秒。关于多线程呢？

其中一个提出的答案涉及使用多个核心。我感到有点惭愧，这是我没有想到的！

答案来了一些棒，因为这是一个I / O约束的假设问题。根据关于I / O的结果，这似乎有些苛刻。在任何情况下，肯定值得一试。

我们将使用fork / join来做到这一点。这里有一个类来表示对文件的一部分的计算结果，要记住左边可能有部分结果（如果我们开始一个数字的一半），右边部分结果（如果缓冲区通过一个数字完成了一半）。这个类还有一个允许我们将两个这样的结果粘合在一起的方法，将它们合并为两个相邻的子任务。

 私人类SumTaskResult {
长小计; 
 int leftPartial; 
 int leftMulCount; 
 int rightPartial; 
 
 public void append（SumTaskResult right）{
 subtotal + = rightward.subtotal + rightPartial 
 * rightward.leftMulCount + rightward.leftPartial; 
 rightPartial = rightward.rightPartial; 
 
 
 
 
 
 
 现在关键位： RecursiveTask 来计算结果。对于小问题（小于64个字符），它调用 computeDirectly（）来计算单个线程的结果;对于较大的问题，它分成两个，分别解决两个子问题，然后结合结果。 
 
   private类SumForkTask扩展了RecursiveTask< SumTaskResult> {
 
 private byte buf []; 
 //包含startPos，endPos独占
 private int startPos; 
 private int endPos; 
 
 public SumForkTask（byte buf []，int startPos，int endPos）{
 this.buf = buf; 
 this.startPos = startPos; 
 this.endPos = endPos; 
 
 $ b $ private SumTaskResult computeDirectly（）{
 SumTaskResult result = new SumTaskResult（）; 
 int pos = startPos; 
 
 result.leftMulCount = 1; （（buf [pos]> = 48）&&（buf [pos] <= 57））{
 result.leftPartial = result.leftPartial * 10 + 
 
。 buf [pos]  -  48; 
 result.leftMulCount * = 10; 
 pos ++; 
} 
 
 int acc = 0; （（buf [i]> = 48）&&（buf [i] <= 57）
 for（int i = pos; i< endPos; i ++） b $ b acc = acc * 10 + buf [i]  -  48; 
 else {
 result.subtotal + = acc; 
 acc = 0; 
} 
 
 result.rightPartial = acc; 
返回结果; （）; 
 
 $ b @Override 
 protected SumTaskResult compute（）{
 if（endPos  -  startPos< 64）
 return computeDirectly 
 int mid =（endPos + startPos）/ 2; 
 SumForkTask left = new SumForkTask（buf，startPos，mid）; 
 left.fork（）; 
 SumForkTask right = new SumForkTask（buf，mid，endPos）; 
 SumTaskResult rRes = right.compute（）; 
 SumTaskResult lRes = left.join（）; 
 lRes.append（rRes）; 
返回lRes; 
 
 
 
 
 
 注意这是在 byte [] ，而不是整个 MappedByteBuffer 。原因是我们想保持顺序的磁盘访问。我们将采取相当大的块，叉/加入，然后移动到下一个块。
 
 
 下面是这样做的方法。请注意，我们已经将缓冲区大小提高到了1MB（以前是次优的，但在这里看起来更明智）。  private long sumBinaryForwardMapForked（）throws IOException {
 RandomAccessFile raf = new RandomAccessFile（file，r）; 
 ForkJoinPool pool = new ForkJoinPool（）; 
 
字节buf [] =新字节[1 * 1024 * 1024]; 
 final FileChannel ch = raf.getChannel（）; 
 int fileLength =（int）ch.size（）; 
 final MappedByteBuffer mb = ch.map（FileChannel.MapMode.READ_ONLY，0，
 fileLength）; 
 SumTaskResult result = new SumTaskResult（）; 
 while（mb.hasRemaining（））{
 int len = Math.min（mb.remaining（），buf.length）; 
 mb.get（buf，0，len）; 
 SumForkTask任务=新的SumForkTask（buf，0，len）; 
 result.append（pool.invoke（task））; 
} 
 ch.close（）; 
 raf.close（）; 
 pool.shutdown（）; 
 return result.subtotal; 
 
 
 
 
 
 $ b现在这里是令人心碎的失望：这个很好的多线程代码现在 32.2秒。为什么这么慢？我花了很长一段时间来调试这个，假设我做了一些非常错误的事情。
 
 
 原来只有一个小小的调整。我认为在小问题和大问题之间64的门槛是合理的;事实证明，这是完全荒谬的。
 
 
 像这样想。子问题的大小完全相同，所以它们应该几乎在同一时间完成。所以没有什么比分配处理器更重要的东西了。在我使用的机器上，只有两个内核，下降到64的门槛是荒谬的：它只是增加了开销。
 
 
 现在你不想要限制的东西，使它只使用两个核心，即使有更多的可用。也许正确的做法是在运行时找出处理器的数量，并将其分成许多部分。
 
 
在任何情况下，如果将阈值更改为512KB（缓冲区大小的一半），它现在在 13.3秒中完成。下降到128KB或64KB将允许使用更多的内核（分别高达8或16），并且不会显着影响运行时间。
 
 
 因此，多线程 有很大的不同。
 
 
 这是一段相当漫长的旅程，但是我们从92.9秒开始，下降到13.3秒...这是原始代码的七倍速度。这并不是通过改善渐近（大哦）的时间复杂度，从一开始就是线性的（最优的）...这一切都是关于改善常数因子的。 
 
 
 
 
 我想我应该尝试使用GPU下一步... 
 
 
  Postscript：生成随机数的文件
 
 
 我使用下面的代码生成随机数，我运行并重定向到一个文件。显然，我不能保证你会得到完全一样的随机数字，我有︰）
 
 
 公共静态无效genRandoms（）{
 Random r = new Random（）; 
 for（int i = 0; i <100000000; i ++）
 System.out.println（r.nextInt（1000000000））; 
 
  
 
 
解决方案做这个。 
 
 
这是经典的多进程编程问题。在C语言中有库MPI可以解决这类问题。
 
 
它的概念是将整数列表分成4个部分，每个部分加起来通过不同的过程。在完成后，进程被汇总在一起。
 
 
在java中，这可以通过线程（伪并行）和java并发来完成。
 
 $ b $例如，4个不同的线程总结列表的4个不同部分。电话公司使用这种并行编程技术的网格计算机对其事务进行总计。
 $ 
 
 
 
 $ b 这里唯一的问题（瓶颈）是IO操作。读取文件将需要很多时间。如果以某种方式，你可以让多个线程读取不同的部分的文件... 
这是非常复杂的方法，我认为这不会有太大的好处，因为磁盘不会旋转得更快只是因为它被许多线程使用，但还有其他的做类似的东西的技术。你可以阅读更多关于这里：通过多线程访问文件和这里用多线程读取单个文件：应该加快速度吗？ 
 
Question

Suppose you have a large ASCII text file, with a random non-negative integer on each line, each in the range from 0 to 1,000,000,000. There are 100,000,000 lines in the file. What's the fastest way to read through the file and calculate the sum of all the integers?

Constraint: we've got 10MB of RAM to work with. The file is 1GB in size, so we don't want to read the whole thing in and then process it.

Here are various solutions I've tried. I found the results rather surprising.

Is there anything faster that I've missed?

Please note: all timings given below are for running the algorithm 10 times in total (run once and discard; start timer; run 10 times; stop timer). The machine is a fairly slow Core 2 Duo.

Method 1: the natural approach

The first thing to try is the obvious approach:
private long sumLineByLine() throws NumberFormatException, IOException {
    BufferedReader br = new BufferedReader(new FileReader(file));
    String line;
    long total = 0;
    while ((line = br.readLine()) != null) {
        int k = Integer.parseInt(line);
        total += k;
    }
    br.close();
    return total;
}
Note that the maximum possible return value is 10^17, which still easily fits in a long, so we don't have to worry about overflows.

On my machine, running this 11 times and discounting the first run takes around 92.9 seconds.

Method 2: a minor tweak

Inspired by a comment on this question, I tried not creating a new int k to store the result of parsing the line, and instead just to add the parsed value directly to total. So this:
    while ((line = br.readLine()) != null) {
        int k = Integer.parseInt(line);
        total += k;
    }
becomes this:
    while ((line = br.readLine()) != null)
        total += Integer.parseInt(line);
I was certain that this wouldn't make any difference, and thought it highly likely that the compiler would generate the same bytecode for the two versions. But, to my surprise, it did shave a little time off: we're down to 92.1 seconds.

Method 3: manually parsing the integer

One thing that bothers me about the code so far is that we turn the String into an int, and then add it on at the end. Might it not be quicker to add on as we go? What happens if we parse the String ourselves? Something like this...
private long sumLineByLineManualParse() throws NumberFormatException,
        IOException {
    BufferedReader br = new BufferedReader(new FileReader(file));
    String line;
    long total = 0;
    while ((line = br.readLine()) != null) {
        char chs[] = line.toCharArray();
        int mul = 1;
        for (int i = chs.length - 1; i >= 0; i--) {
            char c = chs[i];
            switch (c) {
            case '0':
                break;
            case '1':
                total += mul;
                break;
            case '2':
                total += (mul << 1);
                break;
            case '4':
                total += (mul << 2);
                break;
            case '8':
                total += (mul << 3);
                break;
            default:
                total += (mul*((byte) c - (byte) ('0')));   
            }
            mul*=10;
        }
    }
    br.close();
    return total;
}
This, I thought, might save a little time, especially with some bitshift optimisations for doing the multiplication. But the overheads of converting to a character array must swamp any gains: this now takes 148.2 seconds.

Method 4: processing in binary

One last thing we can try is to process the file as binary data.

Parsing an integer from the front is awkward if you don't know the length of it. Parsing it backwards is much easier: the first digit you encounter is units, the next one is tens, and so on. So the easiest way to approach the whole thing is to read the file backwards.

If we allocate a byte[] buffer of (say) 8MB, we can fill it up with the last 8MB of the file, process it, then read the preceding 8MB, and so on. We need to be a little careful that we don't screw up a number that we're in the middle of parsing when we move to the next block, but that's the only problem.

When we encounter a digit, we add it (suitably multiplied according to its position in the numeral) to the total, and then multiply the coefficient by 10 so we're ready for the next digit. If we encounter anything that isn't a digit (a CR or LF), we just reset the coefficient.
private long sumBinary() throws IOException {
    RandomAccessFile raf = new RandomAccessFile(file, "r");
    int lastRead = (int) raf.length();
    byte buf[] = new byte[8*1024*1024];
    int mul = 1;
    long total = 0;
    while (lastRead>0) {
        int len = Math.min(buf.length, lastRead);
        raf.seek(lastRead-len);
        raf.readFully(buf, 0, len);
        lastRead-=len;
        for (int i=len-1; i>=0; i--) {
            //48 is '0' and 57 is '9'
            if ((buf[i]>=48) && (buf[i]<=57)) {
                total+=mul*(buf[i]-48);
                mul*=10;
            } else
                mul=1;
        }
    }
    raf.close();
    return total;
}
This runs in 30.8 seconds! That's a speed increase by a factor of 3 over the previous best.

Follow-up questions


Why is this so much faster? I was expecting it to win, but not quite so impressively. Is it mainly the overheads of converting to a String? And all the worrying behind the scenes about character sets and the like?
Can we do any better than this by using a MappedByteBuffer to help? I have a feeling that the overheads of invoking methods to read from the buffer would slow things down, especially when reading backwards from the buffer.
Would it be better to read the file forwards rather than backwards, but still scan the buffer backwards? The idea would be that you read the first chunk of the file, and then scan backwards, but discarding the half-number at the end. Then when you read the next chunk, you set the offset so that you read from the beginning of the number you discarded.
Is there anything I haven't thought of that could make a significant difference?


Update: more surprising results

First, an observation. It should have occurred to me before, but I think the reason for the inefficiency of the String-based reading is not so much the time taken to create all the String objects but the fact that they are so short-lived: we've got 100,000,000 of them for the garbage collector to deal with. That is bound to upset it.

Now some experiments based on answers/comments people have posted.

Am I cheating with the size of the buffer?

One suggestion was that since a BufferedReader uses a default buffer of 16KB, and I've used a buffer of 8MB, I'm not comparing like with like. It's bound to be faster if you use a bigger buffer.

Here's the shock. The sumBinary() method (Method 4) ran in 30.8 seconds yesterday with an 8MB buffer. Today, code unchanged, the wind direction has changed and we're at 30.4 seconds. If I drop the buffer size down to 16KB to see how much slower it gets, it gets faster! It now runs in 23.7 seconds. Crazy. Who saw that one coming?!

A bit of experimentation suggests that 16KB is about optimal. Perhaps the Java guys did the same experiments, and that's why they went with 16KB!

Is the problem I/O-bound?

I wondered about this too. How much time is spent on disk access, and how much on number crunching? If it's almost all disk access, as suggested by a well-supported comment on one of the proposed answers, then we won't be able to make much improvement whatever we do.

This is easy to test by running the code with all the parsing and number crunching commented out, but with the reading still intact:
private long sumBinary() throws IOException {
    RandomAccessFile raf = new RandomAccessFile(file, "r");
    int lastRead = (int) raf.length();
    byte buf[] = new byte[16 * 1024];
    int mul = 1;
    long total = 0;
    while (lastRead > 0) {
        int len = Math.min(buf.length, lastRead);
        raf.seek(lastRead - len);
        raf.readFully(buf, 0, len);
        lastRead -= len;
        /*for (int i = len - 1; i >= 0; i--) {
            if ((buf[i] >= 48) && (buf[i] <= 57)) {
                total += mul * (buf[i] - 48);
                mul *= 10;
            } else
                mul = 1;
        }*/
    }
    raf.close();
    return total;
}
This now runs in 3.7 seconds! This doesn't look I/O-bound to me.

Of course, some of the I/O speed will come from disk cache hits. But that isn't really the point here: we're still taking 20 seconds of CPU time (also confirmed using Linux's time command), which is plenty big enough to try to reduce it.

Scanning forwards instead of backwards

I'd maintained in my original post that there was good reason to scan the file backwards rather than forwards. I didn't explain that very well. The idea was that if you scan a number forwards, you have to accumulate the total value of the scanned number, and then add it on. If you scan backwards, you can add it to the cumulative total as you go. My subconscious was making some sort of sense to itself (on which more later), but I'd missed one key point, which was pointed out in one of the answers: to scan backwards, I was doing two multiplications per iteration, but with scanning forwards you need only one. So I coded up a forward-scanning version:
private long sumBinaryForward() throws IOException {
    RandomAccessFile raf = new RandomAccessFile(file, "r");
    int fileLength = (int) raf.length();
    byte buf[] = new byte[16 * 1024];
    int acc = 0;
    long total = 0;
    int read = 0;
    while (read < fileLength) {
        int len = Math.min(buf.length, fileLength - read);
        raf.readFully(buf, 0, len);
        read += len;
        for (int i = 0; i < len; i++) {
            if ((buf[i] >= 48) && (buf[i] <= 57))
                acc = acc * 10 + buf[i] - 48;
            else {
                total += acc;
                acc = 0;
            }
        }
    }
    raf.close();
    return total;
}
This runs in 20.0 seconds, beating the backward-scanning version by a distance. Nice.

Multiplication cache

What I realised during the night, though, was that although I was performing two multiplications per iteration, there was the possibility of using a cache to store these multiplications, so that I could avoid having to perform them during backwards iteration. I was pleased to see when I woke up that someone had had the same idea!

The point is that there are at most 10 digits in the numbers we're scanning, and only 10 possible digits, so only 100 possibilities for the value of a digit to the cumulative total. We can precompute these, and then use them in the backward-scanning code. That ought to beat the forward-scanning version, because we've now got rid of the multiplications entirely. (Note that we can't do this with forward scanning, because the multiplication is of the accumulator, which could take any value up to 10^9. It's only in the backward case that both operands are limited to a few possibilities.)
private long sumBinaryCached() throws IOException {
    int mulCache[][] = new int[10][10];
    int coeff = 1;
    for (int i = 0; i < 10; i++) {
        for (int j = 0; j < 10; j++)
            mulCache[i][j] = coeff * j;
        coeff *= 10;
    }

    RandomAccessFile raf = new RandomAccessFile(file, "r");
    int lastRead = (int) raf.length();
    byte buf[] = new byte[16 * 1024];
    int mul = 0;
    long total = 0;
    while (lastRead > 0) {
        int len = Math.min(buf.length, lastRead);
        raf.seek(lastRead - len);
        raf.readFully(buf, 0, len);
        lastRead -= len;
        for (int i = len - 1; i >= 0; i--) {
            if ((buf[i] >= 48) && (buf[i] <= 57))
                total += mulCache[mul++][buf[i] - 48];
            else
                mul = 0;
        }
    }
    raf.close();
    return total;
}
This runs in 26.1 seconds. Disappointing, to say the least. Reading backwards is less efficient in terms of I/O, but we've seen that I/O is not the major headache here. I had expected this to make a big positive difference. Perhaps the array lookup is just as expensive as the multiplications we've replaced. (I did try making the array 16x16, and using bitshifts to index, but it didn't help.)

Looks like forward scanning is where it's at.

Using a MappedByteBuffer

Next thing to add in is a MappedByteBuffer, to see if that's more efficient than using a raw RandomAccessFile. It doesn't need much change to the code.
private long sumBinaryForwardMap() throws IOException {
    RandomAccessFile raf = new RandomAccessFile(file, "r");
    byte buf[] = new byte[16 * 1024];
    final FileChannel ch = raf.getChannel();
    int fileLength = (int) ch.size();
    final MappedByteBuffer mb = ch.map(FileChannel.MapMode.READ_ONLY, 0,
            fileLength);
    int acc = 0;
    long total = 0;
    while (mb.hasRemaining()) {
        int len = Math.min(mb.remaining(), buf.length);
        mb.get(buf, 0, len);
        for (int i = 0; i < len; i++)
            if ((buf[i] >= 48) && (buf[i] <= 57))
                acc = acc * 10 + buf[i] - 48;
            else {
                total += acc;
                acc = 0;
            }
    }
    ch.close();
    raf.close();
    return total;
}
This does seem to improve things a little: we're now at 19.0 seconds. We've taken another second off our personal best!

What about multi-threading?

One of the proposed answers involves using multiple cores. I'm a little ashamed that that hadn't occurred to me!

The answer came in for some stick, because of the assumption that it's an I/O-bound problem. This seems a little harsh, in light of the results about I/O! Certainly worth a try, in any case.

We'll do this using fork/join. Here's a class to represent the result of a computation on part of the file, bearing in mind that there might be a partial result to the left (if we started half way through a number), and a partial result to the right (if the buffer finished half way through a number). The class also has a method for allowing us to glue two such results together, into a combined result for two adjacent sub-tasks.
private class SumTaskResult {
    long subtotal;
    int leftPartial;
    int leftMulCount;
    int rightPartial;

    public void append(SumTaskResult rightward) {
        subtotal += rightward.subtotal + rightPartial
                * rightward.leftMulCount + rightward.leftPartial;
        rightPartial = rightward.rightPartial;
    }
}
Now the key bit: the RecursiveTask that computes the result. For small problems (less than 64 characters), it calls computeDirectly() to calculate the result in a single thread; for larger problems, it splits into two, solves the two sub-problems in separate threads, and then combines the results.
private class SumForkTask extends RecursiveTask<SumTaskResult> {

    private byte buf[];
    // startPos inclusive, endPos exclusive
    private int startPos;
    private int endPos;

    public SumForkTask(byte buf[], int startPos, int endPos) {
        this.buf = buf;
        this.startPos = startPos;
        this.endPos = endPos;
    }

    private SumTaskResult computeDirectly() {
        SumTaskResult result = new SumTaskResult();
        int pos = startPos;

        result.leftMulCount = 1;

        while ((buf[pos] >= 48) && (buf[pos] <= 57)) {
            result.leftPartial = result.leftPartial * 10 + buf[pos] - 48;
            result.leftMulCount *= 10;
            pos++;
        }

        int acc = 0;
        for (int i = pos; i < endPos; i++)
            if ((buf[i] >= 48) && (buf[i] <= 57))
                acc = acc * 10 + buf[i] - 48;
            else {
                result.subtotal += acc;
                acc = 0;
            }

        result.rightPartial = acc;
        return result;
    }

    @Override
    protected SumTaskResult compute() {
        if (endPos - startPos < 64)
            return computeDirectly();
        int mid = (endPos + startPos) / 2;
        SumForkTask left = new SumForkTask(buf, startPos, mid);
        left.fork();
        SumForkTask right = new SumForkTask(buf, mid, endPos);
        SumTaskResult rRes = right.compute();
        SumTaskResult lRes = left.join();
        lRes.append(rRes);
        return lRes;
    }

}
Note that this is operating on a byte[], rather than the whole MappedByteBuffer. The reason for that is that we want to keep the disk access sequential. We'll take quite large chunks, fork/join, and then move to the next chunk.

Here's the method that does that. Note that we've pushed the buffer size up to 1MB (sub-optimal earlier, but more sensible here, it seems).
private long sumBinaryForwardMapForked() throws IOException {
    RandomAccessFile raf = new RandomAccessFile(file, "r");
    ForkJoinPool pool = new ForkJoinPool();

    byte buf[] = new byte[1 * 1024 * 1024];
    final FileChannel ch = raf.getChannel();
    int fileLength = (int) ch.size();
    final MappedByteBuffer mb = ch.map(FileChannel.MapMode.READ_ONLY, 0,
            fileLength);
    SumTaskResult result = new SumTaskResult();
    while (mb.hasRemaining()) {
        int len = Math.min(mb.remaining(), buf.length);
        mb.get(buf, 0, len);
        SumForkTask task = new SumForkTask(buf, 0, len);
        result.append(pool.invoke(task));
    }
    ch.close();
    raf.close();
    pool.shutdown();
    return result.subtotal;
}
Now here's the soul-destroying disappointment: this nicely multi-threaded code now takes 32.2 seconds. Why so slow? I spent quite a while debugging this, assuming I'd done something terribly wrong.

Turns out there was just one small tweak needed. I'd thought the threshold of 64 between small problem and big problem was a reasonable one; turns out that was totally ridiculous.

Think about it like this. The sub-problems are exactly the same size, so they should complete in pretty much the same time. So there's really no point splitting into more pieces than there are processors available. On the machine I'm using, with only two cores, going down to a threshold of 64 is ridiculous: it just adds more overhead.

Now you don't want to limit things so that it only uses two cores even when there are more available. Perhaps the right thing to do would be to find out the number of processors at runtime, and split into that many pieces.

In any case, if I change the threshold to 512KB (half the buffer size), it now completes in 13.3 seconds. Going down to 128KB or 64KB would allow more cores to be used (up to 8 or 16 respectively), and doesn't significantly affect the runtime.

So multi-threading does make a big difference.

It's been quite a long journey, but we started out with something that took 92.9 seconds and we're now down to 13.3 seconds... that's seven times the speed of the original code. And that's not by improving the asymptotic (big-Oh) time complexity, which was linear (optimal) right from the start... this has all been about improving the constant factor.

A good day's work.

I suppose I should probably try using the GPU next...

Postscript: generating the file of random numbers

I generated the random numbers with the following code, which I ran and redirected to a file. Obviously I can't guarantee that you'll end up with exactly the same random numbers that I had :)
public static void genRandoms() {
    Random r = new Random();
    for (int i = 0; i < 100000000; i++)
        System.out.println(r.nextInt(1000000000));
}

 解决方案 
I think there is another way of doing this.

This is classical multiple process programming problem. In C language there is library MPI that solves this kinds of problems.

The idea of it is to chunk the list of integers for example in 4 parts and every part is summed by different process. After finishing, processes are summed together.

In java this could be done  with threads (pseudo parallel) and java concurrency.

E.g 4 different threads summing 4 different parts of the list. At the end they are summed together.

Telephone companies uses Grid Computers that do this kind of parallel programming technic to sum their transactions.

The only problem here (bottleneck) is the IO operation. Reading the file will take much time. If somehow you can make multiple threads read different parts of the file... 
This is very complicated approach and I think that this will not do much good because the disk won't spin faster just because it's used by many threads, but there are other technics of doing similar stuff. You can read more about this here: Access File through multiple threads and here Reading a single file with Multiple Thread: should speed up?

                        这篇关于最快的方法来整理文本文件中的整数的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！


                    
                        查看全文

上午我作弊的缓冲区的大小？

问题是I / O限制吗？

乘法缓存

使用MappedByteBuffer

最快的方法来整理文本文件中的整数 [英] Fastest way to sum integers in text file

问题描述

方法2：小调整

方法3：手动解析整数

后续问题

更新：更令人吃惊的结果

Postscript：生成随机数的文件

Question

Method 1: the natural approach

Method 2: a minor tweak

Method 3: manually parsing the integer

Method 4: processing in binary

Follow-up questions

Update: more surprising results

Am I cheating with the size of the buffer?

Is the problem I/O-bound?

Scanning forwards instead of backwards

Multiplication cache

Using a MappedByteBuffer

What about multi-threading?

Postscript: generating the file of random numbers

相关文章

Java开发最新文章

热门教程

热门工具

登录关闭

最快的方法来整理文本文件中的整数 [英] Fastest way to sum integers in text file

问题描述

方法2：小调整

方法3：手动解析整数

后续问题

更新：更令人吃惊的结果

上午我作弊的缓冲区的大小？

问题是I / O限制吗？

乘法缓存

使用MappedByteBuffer

Postscript：生成随机数的文件

Question

Method 1: the natural approach

Method 2: a minor tweak

Method 3: manually parsing the integer

Method 4: processing in binary

Follow-up questions

Update: more surprising results

Am I cheating with the size of the buffer?

Is the problem I/O-bound?

Scanning forwards instead of backwards

Multiplication cache

Using a MappedByteBuffer

What about multi-threading?

Postscript: generating the file of random numbers

相关文章

Java开发最新文章

热门教程

热门工具

登录 关闭

登录关闭